Sidra Data Platform (version 2020.R3: Hallowed Honeycrisp)¶
released on Nov 4th, 2020
Welcome to the November 2020 release of the Sidra Data Platform. This page documents all the new features, enhancements and visible changes included in the new version 2020.R3: Hallowed Honeycrisp
Sidra 2020.R3 Release Overview¶
In this release we continue building upon the observability and operability of the Sidra platform, as well as growing the interoperability with the Data Products ecosystem to unlock multiple use cases regarding Analytics and Knowledge Mining.
While we are working on the future Web UI improvements that will greatly contribute to speed up operational processes of setting up data connectors and Data Products, for this release we wanted to continue delivering on improvements and automation on the deployment process, as well as on enabling further transparency on the data intake performance and executions.
This version also represents a step forward in our strategy to consolidate and grow capabilities of our Knowledge Store, related to the ingestion and AI insights generation of unstructured data sources (e.g. PDF, DOCX files). On one hand, we have made improvements on the integration with Azure Cognitive Search services. On the other hand, we have delivered a flagship Data Product, called Contract Analysis, which sets the foundation of Knowledge Store and showcases its power in a specific but easily customizable and extensible business case.
What's new in Sidra 2020.R3¶
Here is the list of the most relevant improvements made to the platform as part of this release:
- New Sidra deployment tool
- Operational DW and Reports improvements
- New Web UI section for Authorization Management
- Improvements to the Azure Cognitive Search pipelines
- Metadata inference API for CSV files
- New Application Insights Custom Events
- Improvements in Change Tracking for incremental load
- Updated Batch Account API version
- Updated Databricks to latest 7.2 runtime version
- Unified SignalR hubs for generic handling of notifications
- Automatic creation of staging tables for SQL client pipeline
New Sidra deployment tool¶
We have added a new installation command line interface (CLI) tool, which simplifies and improves several aspects of the previous installation and deployment process of Sidra across environments. The tool allows to package and automate deployment and installation tasks:
This CLI tool has been designed to cover for generic services of Sidra across client installations, including separate and differentiated stages for optional services. The available services are retrieved by the back-office service of Sidra to guide the installation process.
As new environments are configured, new stages are created for that environment in the deployment pipeline. The tool transparently manages the creation of different settings, including managing sensitive configurations through Key Vault.
The CLI includes an option to deploy a pre-release version (preview), which will create the needed directories, deploy Sidra core and push to Git:
Update deployments have also been incorporated as an option in the Sidra CLI tool:
Once all params are configured for the update, or taking the existing ones by default, the update process is executed automatically:
Operational DW and Reports improvements¶
An important improvement coming in this release is the addition of further data and catalogue insights via Power BI reports on the Sidra DW. Several built-in Power BI reports have been made available to display performance indicators on the data intake pipelines, including trends and insights related to entities, volume of data stored, validation errors and pipeline executions status and execution lengths.
These Power BI reports are called to complement some of the operative information already available through the Sidra Web UI Dashboard and Data Catalog sections.
The Power BI DW reports aim at achieving a dynamic, interactive, and detailed view of intake process related information, according to the configurable selected time ranges. This way, we provide a consolidated point of view of the different operative data sources in Sidra, like Core and Log databases.
The DW Power BI reports for Sidra contain the following pre-configured dashboards:
Data Intake Report:
The Data Intake Report displays, for the selected time period in the slicer, different graphs and indicator cards regarding Data Intake volumes: total volume of data stored, percentage of validation, evolution of stored volumes and validation errors over time. This report also provides info on the total assets and number of rows ingested.
Pipeline Executions Report:
This report shows all information related to the overall pipeline executions and performance during the selected time period. In addition to key figures, some trend graphs display the evolution of executions of the pipelines and their duration vs the average pipeline execution duration.
From here, users can drill through a specific pipeline to see more detailed information about the different Pipeline executions.
Detailed Pipeline Executions Report:
The Detailed Pipeline Executions Report gives information about the number of executions of the selected Pipeline by date and duration, as well as related activities. The dashboard contains a Pipeline RunList with details on each execution of the pipeline:
By selecting one of the pipeline executions from the list, users can see the detail including deeper information about that selected execution.
This report presents information regarding the Providers. For the selected period of time, information like the total stored volume, active Entities, total number of Assets and Rows is dynamically provided.
Stored Volume Analysis Report:
A decomposition tree is displayed with a view on the stored volume of data, broken down by different hierarchical categories – DSU, Provider, Entity and Asset:
Finally, this report is focused on providing insights into Anomalies Detection at ingestion time. Insights like the rate of anomalies per Asset, distribution of anomalies by type, Provider and Anomaly trends by date are provided by this report.
New Web UI section for Authorization Management¶
As part of our ongoing effort in offering configuration and management capabilities of Sidra through an easy-to-use web interface, we have developed a new section in Sidra Web UI for displaying and managing Balea Authorisation Framework.
The system now allows to see all applications registered under Sidra Core and allows to assign and enable roles to users, including mappings, delegated credentials and bespoke permission sets on the underlying resources (DSU, Provider and entity scoped authorization).
According to the different perspectives included in the Balea Authorisation framework, the Web UI offers different UI views to manage all entities and relationships of this framework:
Balea Users View:
From the Balea Users View, Sidra Web users can navigate through all the registered subjects (users) in the Balea Applications. A subject in Balea identifies a user -an individual- or a client -a software system- in the Authorization system.
From each of the User items, it is possible to add new delegations for specific applications, including start and end date of such delegation.
Balea User Detail View:
The Balea User Detail view allows to configure, for each of the Balea subjects, and for each of the application, the following configuration:
- See the list of roles that have been enabled for each application
- Assign/unassign roles to the user for each application
- Edit access level permissions (of the user to the underlying resources (Provider, Entity). A permission is the ability to perform some specific operations in Sidra – read, write, and delete- on the metadata of the resource.
Balea Application View:
From the Balea Applications View, Sidra Web users can navigate through all the registered applications with the Balea authorisation framework. Balea supports handling the authorisation for several applications, so the same subject can have a different set of permissions in each application.
Balea Application Detail View:
A Balea Application can have attached different categories of objects within the Balea Authorisation framework.
From the Roles menu, the Application detail view allows to perform several actions related to roles management for a specific Balea application: add/edit/delete roles, configure mappings for these roles, as well as enable/disable roles for the specific application. Mappings allow to implement associations between the roles that come from the authentication system and the roles in the Balea Authorization system.
The Permissions menu allows to list and search specific configured permissions that the app has attached (by means of the implicit association between Roles and Permissions in the Balea Authorization model).
The Delegations menu allows to list and configure permissions delegation between users for a specific Balea application. Delegating a role to second user means this second user can assume a set of permissions to access to the application on behalf of the first user.
Balea API Keys View:
The Balea API Keys view allows to create API keys by configuring their expiration. This View provides the list of API Keys created into the system with their expiration status and allows to delete them as needed.
Improvements to the Azure Cognitive Search pipelines¶
Sidra leverages Azure Cognitive Search pipelines as part of the Knowledge Store capability to process and analyse unstructured types of documents. We have been working on a pipeline template that allows to ingest documents as Assets in Sidra Data Platform, and to create pipelines of Azure and bespoke Plain Concepts knowledge ML tasks, including Optic Character Recognition (OCR) and key phrases extraction over PDF files.
Azure Cognitive Search includes the concept of Skillset, which incorporate consumable AI from Cognitive Services, such as image and text analysis. Therefore, the pipeline can now be easily adapted to accommodate different processing needs by adapting the skillsets configuration or even creating custom skillsets.
During this release, also some operational and usability issues regarding the indexer executions in different scenarios have been improved as part of this feature.
Metadata inference API for CSV files¶
We understood that metadata Asset inference is a key piece of automating and speeding up the process for new entities addition in Sidra. Therefore, we have worked on extending the Assets schema inference by adding the capability to infer CSV file metadata. The Sidra API has been extended to include a new method that receives a Stream of CSV sources and returns the metadata inferred via Microsoft.Data.Analysis package.
New Application Insights Custom Events¶
With this feature we continue building upon the existing observability of the platform and its usage. Application Insights allows to track Web API standard events, like sessions, page views, etc, as well as integrate specific system and API events.
System events captured and centralized include different Identity Server actions, as well as specific data intake API calls.
Additionally, as part of the deployment and operating model defined in the Data Products architecture, we have added with this feature yet another point of standardization with Sidra Core: now Data Products templates will use the Application Insights that is deployed in Core, thus removing the need to deploy any Application Insights instance in the Data Product.
Improvements in Change Tracking for incremental load¶
An option for signalling when to reload all the changes (min valid version) is now useful to control scenarios of expiration of data retention period. Sidra ingestion pipelines now use CHANGE_TRACKING_MIN_VALID_VERSION in SQL to control when Change Tracking can be used to load the changes, and when we need to overwrite the entire table. The improvements incremental load also incorporate support for non change tracking loading logic in SQL pipeline. This has been done by adding a new option to use an incremental query based in parameterization.
In addition, Sidra now supports delete operations in Change Tracking incremental load. The delete support has been incorporated to Transfer Query scripts and we can now support logical deletion through a new metadata field called Sidra_IsRemoved.
Updated Batch Account API version¶
As the API version used in ARM templates to deploy Batch Account is being deprecated, the latest version has been updated in the ARM templates.
Updated Databricks to latest 7.2 runtime version¶
We have updated to latest Databricks 7.2 runtime version, where Apache Spark is upgraded to version 3.0.0.
Unified SignalR hubs for generic handling of notifications¶
In this version, we invested effort on unifying the different types of notifications in the system (Schema detection, Anomaly detection, Webjobs). Notification Types have been added to make this unified mechanism extensible for future notification use cases.
Automatic creation of staging tables for SQL client pipeline¶
While we continue making progress with our major overhaul of Data Products Management model for next version, we have improved the operational overhead of creating staging tables for the SQL Data Product pipelines. From this version we have removed the need to manually create the staging tables needed for such Data Products, so that automatically these tables are created based on the existing metadata.
Issues fixed in Sidra 2020.R3¶
In addition to all the new features of this release, we were able to fix a considerable number of issues, including the following:
- Fixed an issue where the BatchExtractAndIntake pipeline template query could fail due to a wrongly generated query. #97094
- Fixed an issue where the KeyVault helper would fail when the URI ends with a backslash character. #96339
- Improvements to the population of Identity Server's ApiResource and ApiScope when creating a client. #96235
- Fixed an error in which the Transfer Query would insert null records on rows with validation errors when using version 7.x of Databricks Runtime. #96454
- Fixed an error in which the "create_sqlserver_datacatalog_table" procedure would fail due to the JDBC connector not finding a suitable driver. #98142
- Moved the AddUser method from LogContext to the Persistance.Log repository. #97946
- Stopped deploying a local Application Insights in Data Products, so now all Data Products register the activity in Core's Application Insights . #96325
- Fixed an issue in which the AdditionalProperties is not correctly generated during inference process when invalid characters are found. #98209
- Fixed some errors on recent database migrations. #98527
- Fixed and error when generating a token to upload an asset to the landing zone using the date in the path. #97740
- Fixed a condition where Get-AzApplicationInsights was not found when deploying a Data Product by changing to AzureRm command instead. #98228
- Fixed an issue where the authentication request against identity server would fail with a BadRequest response if the client_secret contains special characters. #98222
- Fixed an issue where an ADF activity will fail if the extract inside the ForEach fails, causing the subsequent activities not to be executed as well. #98696
- Fixed an issue where the Identity Server Core client Id returns an incorrectly formatted URL. #97130
- Fixed an issue where entities with certain special characters would fail to load. #98790
- Changed the behaviour when a new application is created so it uses existing roles in Sidra, instead of creating new roles for the App. #96327
- Fixed an issue in which the DatabaseBuilder returns a NPE when ran in localhost. #98733
- Removed the residual Azure Tables support from the ClauseExtraction pipeline template, eliminating issues with large document intake. #97251
- Fixed an error where the application register would fail when granting permissions. #98263
- Changed the DataFactoryMetricLogs process to avoid concurrent executions. #98702
- Fixed an issue where DataFactoryMetrixLog task would not finish. #99043
- Fixed an issue by which the Sidra CLI Update method would fail due to missing IConsole service. #99232
- Fixed an issue where the application registering was failing because endpoint response was missing required output data. #100526
- Fixed an issue where a change in the data type in a field was affecting the view DataLakeVolume. #100927
- Fixed an issue happening when deploying Sidra from scratch in some circumstances, which was preventing DalatakeDeploy from executing properly. #100120
- Changed the behavior of client deployment script for no longer failing if no client database is created. #99982
- Solved an error related to AzureSearchIndexingService background job when using secure context. #100996
- Fixed conversion errors being thrown on view DW DataLakeVolume by changing the data type in a field of the underling table. #100927
- Solved an issue with one WebAPI endpoint, which was not properly checking DatabaseName naming rules. #97855
- Fixed some breaking changes in read_json method from Pandas 1.1.2 that were introduced by Azure Databricks upgrade. #99979
- Solved an issue where in specific scenarios, the pipeline information was not stored in the log table DataFactoryPipelineRun. #100100
- Fixed the retry behavior of GetStorageKey function when requests are timing out. #100629
- Fixed an issue where the deployment always installed the default size (S) in 1.7.1 version. #100728
- Fixed an eror that caused, in some deployment script executions, to take wrong priorities to get the information of the installation size in 1.8 version. #100728
- Fixed an issue which, under certain circumstances and when using optional entities to extract, was causing the Data Product extraction pipeline to fail. #97890
- Fixed transfer queries problems when the entity had an escaped character. #99003
- Now Python SDK Authentication module gets token using dictionary in the POST request to guarantee that data is form-encoded and prevent errors with special characters. #98260
- Fixed an issue in GIT operations performed by specific Sidra CLI client installations by using Azure DevOps Personal Access Tokens. #100834
- Fixed an issue where, in some specific scenarios, the KeyVault policy set was failing due to missing configuration. #101553
- Fixed an error on authentication for CreateTable notebook when Client Secret contains some special characters. #98222
- Fixed an issue where, under certain conditions, the databricks linked service was not created correctly. #102347
- Fixed an issue when a query filter was specified as parameters in the API to retrieve the providers. #102741
- Solved different issues rooted in wrong type setting of idProvider in StorageVolumeProvider. #102284
- Solved potential issues when getting tokens for uploading to LandingZone by limiting to Admin user only. #102310
- Fixed issue when trying to parse KeyVault placeholders when pipeline definition JSON is not split into multiple lines. #103521
- Fixed an issue that, under special circumstances, could generate invalid internal credentials to encrypt the data when using encryption pipelines. #103433
Breaking Changes in Sidra 2020.R3¶
Transfer query activities changed from "RunScript" to "RunNotebook"¶
This release includes a change in the execution of transfer query activities. From this release, the logic execution of such activities is performed via Jupyter Notebook instead of Python scripts. This change was required to address some limitations in monitoring and debugging of failed transfer queries.
After upgrading to this release, all transfer queries need to be recreated so they are uploaded as notebooks to Databricks. The location will be in the Databricks workspace, under Shared folder.
Automatically create staging tables¶
This release includes a change for automatically creating staging tables for SQL Data Products pipeline.
The required actions will be needed only for Data Products with database for extracting the content from the DSU and using the pipeline templates provided by Sidra. These required actions consist of: - In your VS solution, remove all staging tables from the database project. These tables will be dropped and created by Sidra each time for the selected entity. - In your orchestrator procedure, alter staging tables to provide any keys or indexes that were in place before to ensure that the query used to read the data is optimized.
It is with great sadness that we announce here that there will be no more releases of Sidra in 2020.
Having said that, we are making good progress internally working on the foundations of the future Sidra Web UI, coming in the first release of 2021. For this next release, we are planning important graphical configuration features to be released, which represent an important step forward in the operability and management of the platform. The new capabilities of Sidra Web will be mostly centered around two big themes:
- Intake connectors step by step graphical user interface, starting with most used connectors across our client base.
- Data Products easy management and creation interface. This leverages the foundations of the new Data Product model and internal API.
On top of this, we are working on further automating Provider Import/Export in order to allow an easy migration along Sidra environments.
We would love to hear from you! For issues, contact us at [email protected]. You can make a product suggestion, report an issue, ask questions, find answers, and propose new features.