Sidra Data Platform (version 2022.R1: Joyous Junaluska)¶
released on Mar 01, 2022
During this latest release cycle, we have continued to put a big focus on the data intake plugins' model and key building blocks to allow for a fully agile release cycle of plugins, decoupled as much as possible from Sidra release cycles.
This plugin model has been improved with the incorporation of the concept of Data Intake Process, described below.
This release comes with a lot of improvement features on the Sidra Management Web UI, like the addition of tags to Attributes, and the ability to assign permission to Sidra roles from the web.
On top of that, deployment and operability improvements have been made, such as a reworked integration of logs and notifications with the Application Insights, support for feature toggling, or CLI enhancements.
The key features around plugins continue paving the way towards a full lifecycle for the configuration of processes related to data ingestion.
Sidra 2022.R1 Release Overview¶
This release comes with very significant features and improvements, some of which represent a big architectural change that is the cornerstone of the new Sidra:
- Data Intake Process model with plugins
- SharePoint online library connector plugin
- New Databricks Data Product template
- Support for Excel data intake
- Data Catalog and web improvements
- Support for multiple pipelines in Azure Search
- New Data Product Sync behavior
- Support for custom naming conventions in staging tables in Data Products
- Updated Databricks to latest 10.3 runtime version
- Logs and Notifications in AppInsights
- Improvements in plugins type mappings
- CLI improvements
- Esquio integration for feature toggling
- New plugin type for pipeline templates installation
- Documentation improvements
What's new in Sidra 2022.R1¶
Data Intake Process model with plugins¶
In the course of this latest release, we have continued to improve the data intake plugins model and setting the cornerstone to allow a full plugin management life-cycle for our users. Once this full life-cycle is completed, this will allow mainly:
To perform actions of re-configuration of data intake processes.
To upgrade the pipelines to new versions in an automated and controlled manner.
It is also worth highlighting a new concept as the Data Intake Process, process which bridges the two worlds of plugin execution and underlying data intake infrastructure creation and operations.
Users are now able to access a new section of the web called Data Intake Processes to see the list of configured Data Intake Processes in any given Sidra installation.
In a future release, from this list, users will be able to invoke actions such as update (the ability to re-execute the plugin with new configuration parameters), or upgrade to a new version of the plugin.
That brings us to the Connector Plugin concept. Every connector plugin execution will create a Data Intake Process object in Sidra, that in turn, will create the corresponding data integration infrastructure elements, the data governance structures and the metadata. The available connector plugins in this latest release of Sidra are Azure SQL connector, SQL Server connector and SharePoint online library connector.
A new automated migration process has been developed to convert the existing pipelines to the new Data Intake Model. This will be executed by our support team as part of the Sidra update process in this release 2022.R1.
Please, check on the Breaking Changes section for more information about the Data Intake Process migrations.
SharePoint online library connector plugin¶
A new connector plugin for extracting and ingesting data from SharePoint online library data sources is released as part of this Sidra version.
This new connector plugin comes with several configuration options which cover different scenarios depending on the destination container and its relative file types.
New Databricks Data Product template¶
This release incorporates a new Databricks Data Product template as an additional extension to the regular Basic SQL Data Product.
This template will cover different advanced scenarios where the user may require custom data query logic on the raw data lake or advanced data aggregation and dependencies logic.
Support for Excel data intake¶
A new process for data intake from Excel files has been implemented in Sidra in order to embrace the multiple scenarios coming from the users to do the consumption of their data properly. Thanks to this process now it is possible:
To automate the configuration of data tables in complex Excel files.
To accelerate through automation the definition of the Sidra metadata for these tables (Entities).
To automate the data ingestion to the Data Lake with a minimal setup.
Thus, our Excel ingestion process is capable to infer the schema of the complex-formatted Excel files in order to define how to Entities and Views are going to be created, as well as how the data is going to be extracted.
Data Catalog and web improvements¶
Sidra Web UI incorporates in this release multiple functionality upgrades, redesigns and performance improvements.
Below is a list with all the improvements included as part of this feature:
Assign permissions to roles
Now user can edit the set of permissions that are included for each role as defined in each application in Sidra. With this, the full Balea Authorization framework is now implemented from the web.
The notifications' page has been redesigned to improve performance at displaying and marking as read the notifications. The page now includes a limit to the number of notifications to be displayed, to provide a clearer view and improve the performance.
Add tags to Attributes
With this release it is now possible to add tags to the Attributes in the Data Catalog. This completes the data management possibility to codify / categorize the Assets in Sidra Data Catalog at all three levels of data ingestion Assets: Provider, Entities and Attributes.
New Entities list
The Data Catalog view for Entity detail has been redesigned to improve the navigation through Entities and convey the most important information in an optimized way.
Support for multiple pipelines in Azure Search¶
Before this release, it was only possible to execute one pipeline per defined Entity in the binary file ingestion flow. The file indexing requires to process in parallel a set of files, so that they are indexed using different skillsets or indexing modules.
A data ingestion pipeline for Azure Search or binary file ingestion basically defines a skillset or indexing workflow to be executed over the files belonging to an Entity.
For supporting multiple simultaneous indexing workflows for the same Entity, the binary file ingestion module has been adapted to support a one-to-many relationship between Entities and data indexing pipelines. DSU tables and intermediate artifacts in Sidra like the Knowledge Store have been also adapted to support this multiple-pipelines scenario.
This support for multiple pipelines only applies to binary file ingestion.
For other types of data intake, like ingestion from structured or semi-structured data sources, the one-to-one relationship between Entity and data ingestion pipeline remains.
New Data Product Sync behavior¶
A new Sync behavior has been incorporated through a Sync webjob to be responsible for:
The metadata synchronization between the DSU and the Data Products.
Triggering the actual client data orchestration pipelines.
This new behavior is now more robust against possible errors during the Data Intake Process, where some of the Entities could be loaded while others not. With this release of Sidra, when this scenario happens, capturing and processing at the Data Product side is guaranteed to include all intermediate successful Assets into the Data Product synchronization. This applies to all Assets which are due to be ingested in the previous scheduled loads, up to the most recent day in which there are available Assets for every mandatory Entity.
Support for custom naming conventions in staging tables in Data Products¶
The API now contains some parameters to enable custom table name prefix in the Databricks and the staging tables. This setting is disabled by default. If it is enabled, you can specify your own table name prefix, and if table prefix is empty, the resulting name will be just the name of the table at source.
These settings are the settings
CustomTableNamePrefix, and need to be used as parameters to call the metadata extraction pipeline.
Please, also read the breaking changes section for more information on different implementation alternatives.
Updated Databricks to latest 10.3 runtime version¶
We have updated to latest Databricks 10.3 runtime version, where Apache Spark is upgraded to version 3.2.1.
Logs and Notifications in AppInsights¶
Sidra includes a broad range of possibilities to monitor and track the operational processes, especially during data intake: logs, notifications, KPIs in Sidra web and operational Power BI dashboard. In this new version, the product leverages the huge, centralized monitoring capabilities of Azure Application Insights to build upon the operability of the platform. With this release, all logs and notifications are sent to Application Insights, on top of the Azure Data Factory metrics. This enables the following:
Full access to metrics.
Dashboard and alerts building through workbooks.
Log Analytics queries.
Sidra now allows custom queries to be built using and combining these data sources to spot specific metrics relevant to users, at business or operational level.
Improvements in plugin type mappings¶
In this release, we continue building upon the existing plugins and data intake pipeline templates as well as improving and adding new type translations between sources and Sidra components. More specifically, Oracle and SQL Server type mappings have been thoroughly reviewed and extended to cover for more scenarios.
As part of our ongoing effort on robustness and usability of the CLI tool, in this release we have included an enhancement whereby:
The CLI tool now checks for Git, AzureCLI and Dotnet requirements being installed first before continuing. This contributes to a cleaner installation and reduces failed attempts due to pre-requirements not being in place.
Additionally, more descriptive error messages have been included in different commands to guide through the process and reduce troubleshooting time in case some pre-requirements or permissions are not in place.
Feature toggling using Esquio¶
Sidra now uses Esquio to enable scenarios of feature toggling in Sidra. Esquio is a feature server integrated with Sidra backoffice, that allows to define features and apply different release / visibility of features based on a broad number of criteria. Further evolutions of Esquio will be explored to support complex access logic in Sidra Web UI and Sidra API.
New plugin type for pipeline templates installation¶
Sidra architecture is continuously evolving to scale up and out according to growing roadmap requirements and customer needs. Modularity is a key principle of Sidra architecture, and the plugin model is one of the most important examples. In order to allow for a more decoupled evolution of data intake and the rest of the platform components, we have re-used the plugin model for a new use case. A new type of plugin, called pipelinedeployment , has been added to support versioning and easy installation of data intake pipelines even if they are not implemented by a full plugin on the web. This model is useful for versioning and installing bespoke data intake pipelines, outside as much as possible, of Sidra release life-cycle.
For the purpose of keep improving our transparency and support for the users, Sidra documentation website has included several guides in this new release:
- New CLI documentation pages.
- New Data Product pages including Data Product deployment.
- Added tutorial for data intake from Excel files.
- Added tutorial for Schema evolution and data intake from CSV files.
Issues fixed in Sidra 2022.R1¶
Sidra's team is constantly working on improving the product stability, and the following list includes the more relevant bugs fixed and improvements developed as part of this new release:
- Fixed an issue when retrieving runId from Databricks, whereby a change in the runId numeration was triggering an error in the Data Product pipelines. #137836
- Fixed an issue where the dataset template for a specific activity was passing a wrong reference of the dataset template, therefore the default values were incorrect. #137369
- Solved an issue in pipeline templates, by removing index for ItemId and adding a composite index and force the dataset(source)templatepipelinetemplate to be update only when the ItemId IdDataset(source)and PipelineId exists. #137446
- Fix 'time' translation cast to string in parquet files in Azure SQL connector plugin #134759
- Solved an issue when synchronizing data in some Data Product templates when there are no assets in Core DB. #137402
- Fixed an issue in Transfer Query, which failed for fail ingestion when including
IdSourceItemto create delta inserts. #137193
- Fixed an issue in CLI tool when executing New-AzADAppCredential for Automation Account. #134290
- Fixed an issue in connector plugins that made notifications fail under specific circumstances. #134944
- Solved an issue in Persistence Client seed project in Data Products, which did not create the ADF dataset automatically. #134955
- Fixed a problem in connector plugins where the validation of existing Provider name was not informing the user correctly. #133847
- Fixed an issue in transfer query script that was preventing the data preview tables from being generated due to missing import packages. #134385
- Solved an issue with DATE and to high DECIMAL precision type translations in Oracle data sources. #134529
- Amended an issue in Sidra CLI that was not setting up properly the Redirect URIs and Identifier URI when creating AAD applications. #135136
- Fixed an issue in Identity Server application that was preventing Sidra plugins from being installed correctly. #135176
- Added a final status setup for an Asset that does not contain data. #131351
- Fixed some incomplete data configuration persistence while deploying Sidra. #133095
- Solved an issue where installation verification of Azure CLI stopped execution if there was a new available Azure CLI version. #134623
- Corrected some type translation errors when attempting type translation for source types with Oracle databases. #131753
- Fixed a cosmetic issue in alignment of elements in the plugins' gallery. #135716
- Fixed 'time' translation cast to string in parquet files in Azure SQL connector plugin. #134759
- Solved a problem related to internal users in IS that was provoking login and password recovery issues for internal user and password settings. #165506
- Fixed an issue in Basic SQL Data Product where the template deployment was not done correctly in dotnet 6. #133201
- Corrected an issue with big images in the Data Product detail management page in Sidra web. #131981
- Fixed an issue in intake pipelines with incremental loads configured without use change tracking. #135551
- Removed unused parameters on Data Product template that are throwing errors at deployment. #127096
- Fixed an issue related with compatibility with SQLServer versions lower than 2016. #133559
- Solved an issue where
DataFactoryManagerwas deleting resources in the wrong order in Data Products. #131863
- Fixed an issue where the
GetEntitydetail method was not returning the correct imageURL for the Entity. #134300
- Amended an issue where
GetEntitiesToExportactivity was not using the custom Staging table name defined in
- Corrected an issue where
CopyEntityToStagingData Product pipeline was not using the appropriate dataset. #134372
- Fixed an issue where Data Products registration failed when using implicit client authentication. #134236
- Fixed an error with stored procedure
GrantClientIentityAccessunder specific circumstances. #133950
- Fixed an issue with metadata extraction pipeline compatibility with SQL Server 2014. #132969
- Solved a problem where
BatchExtractAndIntakeSQLPicturespipeline was using old database table names used before. #133172
- Fixed an issue where
BatchExtractAndIntakeSQLMultiDBCompatibilityLevel100dataset used for activity was hard-coded, taking now the value from extractSqlServerDataToBlobInputDataset parameter. #134522
- Solved a problem where
BatchExtractAndIntakeSQLMultiDBCompatibilityLevel100pipeline was not pointing to the properly activity output to take CurrentChangeTrackingValue value. #134530
- Fixed and issue where long tags for Entities were failing to be added to the Data Catalogue in Sidra Web. #136059
- Fixed an issue in DSU deployment stage when failing to create some resources under some circumstances. #134333
- Solved a problem where, for AAD applications, the granting was failing because if the app is just created, it is "locked" so permissions cannot be granted, by adding retries. #129639
- Fixed an issue where
EntityDTOwas not being correctly validated so table model and validation rule were mismatching. #131559
- Amended an issue where the Logs cleanup job was failing to execute. #130621
- Fixed an issue where the stored procedure to create the staging tables in the Data Product was using the Provider name instead of the database name. #132460
- Solved an issue in Data Products queries to the DSU when column names with spaces and other chars were not correctly sanitized. #132556
- Added a fix to allow customize database parameters in a Data Product based on SQL. #132519
- Solved an issue where the functionality to mark notifications as read from Sidra Web was not working correctly for a very big number of notifications. #132870
- Fixed an issue where the call
[POST] /api/metadata/providers/rawwas not filling all the required values. #133168
- Solved an issue when granting permissions to Data Products due to a length cause. #133149
- Fixed an issue in an intake pipeline template that avoided to create assets when there was no data from source. #131662
- Solved an issue when deploying on a VNET related to
- Fixed an issue with
NUMERICtype translations in SQL plugins. #130502
- Amended an issue where the KPI cards for the dashboard in Sidra web were not displaying correctly when more than one DSU was installed. #121480
- Fixed an issue in User permissions section in Sidra web due to parent security path not being mandatory. #130863
- Fixed an issue with incorrect type translation with DB2 with timestamp values. #130820
- Added a fix in transfer queries script to prevent failure caused by parallelization in initial stages of data load. #124109
- Solved performance optimizations when synchronizing metadata in Data Products. #128545
- Fixed a performance issue in the notifications page in Sidra web when there was a big number of notifications to be displayed. #131604
Az.Automationas part of the prerequisites to install Sidra. #128438
- Solved an issue with build pipelines for CLI tool due to pool image not being specified in the template. #130822
- Solved a type size issue in JSON column in DW that was causing the stored procedure to fail in the DW. #133670
- Solved an issue with installation size L not being correctly configured in Sidra CLI. #128817
- Fixed an issue with the connectors' wizard in Sidra web where "number of tables per batch" input was not being validated correctly. #119267
- Fixed an issue in rolling back connectors plugin infrastructure when using an existing trigger. #125778
- Fixed an issue that avoided to update a deployment if needed. #125523
- Solved an issue with IS registration when user email has special characters. #125993
- Fixed an issue in Sidra CLI tool, where "deploy source" command logging traces of error incorrectly. #127063
- Solved an issue with build pipelines for CLI tool due to pool image not being specified in the template. #130822
- Solved issues with SignalR in log files due to non-serializable notifications. #128563
DeployDataFactoryPipelinesstep to fail while updating Sidra after recreating AAD Applications. #128910
- Solved an issue with password recovery email styling. #125841
- Fixed an issue when deploying a Data Product due to missing parameter environmentDescription. #129149
- Fixed an issue generating password for databases. #129500
- Solved syntax error issue in incremental load pipelines for Azure SQL and SQL Server. #129584
- Fixed an issue in AzureSearch preventing Assets from being created after all the Assets batches. #124545
- Solved some errors when deploying Sidra on VNET. #129969
- Solved an issue with duplicated records in staging tables with same IdAsset. #136592
Breaking Changes in Sidra 2022.R1¶
Change tiers in installation sizes¶
This is not a breaking change per se, rather a change that needs to be considered from a cost perspective. The default size for
SApp Service plan is S2 instead of S1. The rest of the resources stay the same.
The Elastic Pool has also been increased in all sizes, as per what is included in the table below.
No action is required. These are the recommended sizes to account for the correct functioning of a Sidra installation of this size. However, if is necessary to change, sizes of any resource group can be changed using the CLI optional parameters.
Parameter S M L XL ElasticPoolDtu 100 200 400 800 ElasticPoolDatabaseDtuMin 0 0 0 0 ElasticPoolDatabaseDtuMax 50 100 200 400 ElasticPoolEdition Standard Standard Standard Standard
Custom naming convention for the Databricks tables and the staging tables in Data Products¶
The API now contains some parameters to enable custom table name prefix in the Databricks and the staging tables. This setting is disabled by default. When enabled, you can specify your own table name prefix, or placeholder to name the tables.
Regarding the need to update or add new metadata, there will be several options of configuration. Also, two different naming convention modes will be possible, depending on the need to use fixed value or placeholders as prefixes in the names of the tables:
If the setting
EnableCustomTableNamePrefixis set to
false, indepentently of the value set to the parameter
CustomTableNamePrefix, the names will follow the new default naming convention provided by Sidra: "databasename_schemaname_tablename".
If the setting
EnableCustomTableNamePrefixis set to
true, then you can use either a fixed value as the prefix for the naming convention, or a placeholder:
2.1. If using a fixed value to concatenate fixed prefixes to the actual table name, then you need to populate the field
CustomTableNamePrefixwith a string value. This will concatenate the prefix configured in
customTableNamePrefixto the actual name of the table.
2.2. If using a placeholder value to concatenate a dynamic prefix to the actual table name, then you need to populate the field
CustomTableNamePrefixwith a placeholder string value.
You can see more details and examples of use in How naming conventions for Data Product staging tables work.
Data Intake Process migrations¶
For all configured pipelines created before the release of the Data Intake Process (version 1.11.x - 2022.R1), there is an automated migration process to be applied by support team on every installation environment together with the Sidra update process.
This migration process creates the underlying Data Intake Process objects in Sidra Core metadata database, even if the configured intake pipelines were not created via a connector plugin (e.g. custom pipelines), or were created by a connector plugin that is not supporting this Data Intake Process wizard yet.
No action is required on user's side:
- The described migration script will be planned and executed by support when updating to this release version.
- After this migration, users can expect to see in the Data Intake Process list a list with the configured data intake in the environment.
- This will not affect in any means the normal functioning of the underlying data extraction pipelines, which will continue with no changes.
- Users are also not expected to perform any changes to their current working pipelines.
The migration to create these Data Intake Processes will only therefore be available for the list of supported connector plugins in Sidra.
For more details, please check on the Data Intake Process documentation.
This 2022.R1 release represents an important step towards making the existing pipelines and data ingestion robust enough to handle new scenarios.
Important steps have been taken to set the foundations for the plugin lifecycle model with the incorporation of the Data Intake Process concept, the operational improvements in managing plugins and the usage of the Entity Pipeline association also for plugins.
A new category of data ingestion from complex Excel files enables today a flexible and automated way to extract metadata structures from complex Excel worksheets.
Data querying from the lake has been completed with advanced scenarios of data querying and relationships between Entities.
Additionally, new plugins like the SharePoint connector plugin have been released, using the previously released plugin model. This completes a full end to end binary file ingestion user flow, just by configuring a few parameters from Sidra Web.
As part of the next release, we are planning to increase the plugin catalogue to create and edit Data Intake Processes from Sidra Web, with the release of new plugins for connector and Data Products. New features for schema evolution and for improving binary file ingestion will also be incorporated.
We would love to hear from you! You can make a product suggestion, report an issue, ask questions, find answers and propose new features at our Sidra ideas portal, or by reaching out to us in [email protected].