Sidra Data Platform (version 2022.R1: Joyous Junaluska)¶

released on Mar 01, 2022

During this latest release cycle, we have continued to put a big focus on the data intake plugins' model and key building blocks to allow for a fully agile release cycle of plugins, decoupled as much as possible from Sidra release cycles.

This plugin model has been improved with the incorporation of the concept of Data Intake Process, described below.

This release comes with a lot of improvement features on the Sidra Management Web UI, like the addition of tags to Attributes, and the ability to assign permission to Sidra roles from the web.

On top of that, deployment and operability improvements have been made, such as a reworked integration of logs and notifications with the Application Insights, support for feature toggling, or CLI enhancements.

The key features around plugins continue paving the way towards a full lifecycle for the configuration of processes related to data ingestion.

Sidra 2022.R1 Release Overview¶

This release comes with very significant features and improvements, some of which represent a big architectural change that is the cornerstone of the new Sidra:

Data Intake Process model with plugins
SharePoint online library connector plugin
New Databricks Data Product template
Support for Excel data intake
Data Catalog and web improvements
Support for multiple pipelines in Azure Search
New Data Product Sync behavior
Support for custom naming conventions in staging tables in Data Products
Updated Databricks to latest 10.3 runtime version
Logs and Notifications in AppInsights
Improvements in plugins type mappings
CLI improvements
Esquio integration for feature toggling
New plugin type for pipeline templates installation
Documentation improvements

What's new in Sidra 2022.R1¶

Data Intake Process model with plugins¶

In the course of this latest release, we have continued to improve the data intake plugins model and setting the cornerstone to allow a full plugin management life-cycle for our users. Once this full life-cycle is completed, this will allow mainly:

To perform actions of re-configuration of data intake processes.
To upgrade the pipelines to new versions in an automated and controlled manner.

It is also worth highlighting a new concept as the Data Intake Process, process which bridges the two worlds of plugin execution and underlying data intake infrastructure creation and operations.

Users are now able to access a new section of the web called Data Intake Processes to see the list of configured Data Intake Processes in any given Sidra installation.

In a future release, from this list, users will be able to invoke actions such as update (the ability to re-execute the plugin with new configuration parameters), or upgrade to a new version of the plugin.

That brings us to the Connector Plugin concept. Every connector plugin execution will create a Data Intake Process object in Sidra, that in turn, will create the corresponding data integration infrastructure elements, the data governance structures and the metadata. The available connector plugins in this latest release of Sidra are Azure SQL connector, SQL Server connector and SharePoint online library connector.

A new automated migration process has been developed to convert the existing pipelines to the new Data Intake Model. This will be executed by our support team as part of the Sidra update process in this release 2022.R1.

Please, check on the Breaking Changes section for more information about the Data Intake Process migrations.

SharePoint online library connector plugin¶

A new connector plugin for extracting and ingesting data from SharePoint online library data sources is released as part of this Sidra version.

This new connector plugin comes with several configuration options which cover different scenarios depending on the destination container and its relative file types.

New Databricks Data Product template¶

This release incorporates a new Databricks Data Product template as an additional extension to the regular Basic SQL Data Product.

This template will cover different advanced scenarios where the user may require custom data query logic on the raw data lake or advanced data aggregation and dependencies logic.

Support for Excel data intake¶

A new process for data intake from Excel files has been implemented in Sidra in order to embrace the multiple scenarios coming from the users to do the consumption of their data properly. Thanks to this process now it is possible:

To automate the configuration of data tables in complex Excel files.
To accelerate through automation the definition of the Sidra metadata for these tables (Entities).
To automate the data ingestion to the Data Lake with a minimal setup.

Thus, our Excel ingestion process is capable to infer the schema of the complex-formatted Excel files in order to define how to Entities and Views are going to be created, as well as how the data is going to be extracted.

Data Catalog and web improvements¶

Sidra Web UI incorporates in this release multiple functionality upgrades, redesigns and performance improvements.

Below is a list with all the improvements included as part of this feature:

Assign permissions to roles

Now user can edit the set of permissions that are included for each role as defined in each application in Sidra. With this, the full Balea Authorization framework is now implemented from the web.
Notifications' redesign

The notifications' page has been redesigned to improve performance at displaying and marking as read the notifications. The page now includes a limit to the number of notifications to be displayed, to provide a clearer view and improve the performance.
Add tags to Attributes

With this release it is now possible to add tags to the Attributes in the Data Catalog. This completes the data management possibility to codify / categorize the Assets in Sidra Data Catalog at all three levels of data ingestion Assets: Provider, Entities and Attributes.
New Entities list

The Data Catalog view for Entity detail has been redesigned to improve the navigation through Entities and convey the most important information in an optimized way.

Support for multiple pipelines in Azure Search¶

Before this release, it was only possible to execute one pipeline per defined Entity in the binary file ingestion flow. The file indexing requires to process in parallel a set of files, so that they are indexed using different skillsets or indexing modules.

A data ingestion pipeline for Azure Search or binary file ingestion basically defines a skillset or indexing workflow to be executed over the files belonging to an Entity.

For supporting multiple simultaneous indexing workflows for the same Entity, the binary file ingestion module has been adapted to support a one-to-many relationship between Entities and data indexing pipelines. DSU tables and intermediate artifacts in Sidra like the Knowledge Store have been also adapted to support this multiple-pipelines scenario.

This support for multiple pipelines only applies to binary file ingestion.

For other types of data intake, like ingestion from structured or semi-structured data sources, the one-to-one relationship between Entity and data ingestion pipeline remains.

New Data Product Sync behavior¶

A new Sync behavior has been incorporated through a Sync webjob to be responsible for:

The metadata synchronization between the DSU and the Data Products.
Triggering the actual client data orchestration pipelines.

This new behavior is now more robust against possible errors during the Data Intake Process, where some of the Entities could be loaded while others not. With this release of Sidra, when this scenario happens, capturing and processing at the Data Product side is guaranteed to include all intermediate successful Assets into the Data Product synchronization. This applies to all Assets which are due to be ingested in the previous scheduled loads, up to the most recent day in which there are available Assets for every mandatory Entity.

Support for custom naming conventions in staging tables in Data Products¶

The API now contains some parameters to enable custom table name prefix in the Databricks and the staging tables. This setting is disabled by default. If it is enabled, you can specify your own table name prefix, and if table prefix is empty, the resulting name will be just the name of the table at source.

These settings are the settings EnableCustomTableNamePrefix and CustomTableNamePrefix, and need to be used as parameters to call the metadata extraction pipeline.

Please, also read the breaking changes section for more information on different implementation alternatives.

Updated Databricks to latest 10.3 runtime version¶

We have updated to latest Databricks 10.3 runtime version, where Apache Spark is upgraded to version 3.2.1.

Logs and Notifications in AppInsights¶

Sidra includes a broad range of possibilities to monitor and track the operational processes, especially during data intake: logs, notifications, KPIs in Sidra web and operational Power BI dashboard. In this new version, the product leverages the huge, centralized monitoring capabilities of Azure Application Insights to build upon the operability of the platform. With this release, all logs and notifications are sent to Application Insights, on top of the Azure Data Factory metrics. This enables the following:

Full access to metrics.
Dashboard and alerts building through workbooks.
Log Analytics queries.

Sidra now allows custom queries to be built using and combining these data sources to spot specific metrics relevant to users, at business or operational level.

Improvements in plugin type mappings¶

In this release, we continue building upon the existing plugins and data intake pipeline templates as well as improving and adding new type translations between sources and Sidra components. More specifically, Oracle and SQL Server type mappings have been thoroughly reviewed and extended to cover for more scenarios.

CLI improvements¶

As part of our ongoing effort on robustness and usability of the CLI tool, in this release we have included an enhancement whereby:

The CLI tool now checks for Git, AzureCLI and Dotnet requirements being installed first before continuing. This contributes to a cleaner installation and reduces failed attempts due to pre-requirements not being in place.
Additionally, more descriptive error messages have been included in different commands to guide through the process and reduce troubleshooting time in case some pre-requirements or permissions are not in place.

Feature toggling using Esquio¶

Sidra now uses Esquio to enable scenarios of feature toggling in Sidra. Esquio is a feature server integrated with Sidra backoffice, that allows to define features and apply different release / visibility of features based on a broad number of criteria. Further evolutions of Esquio will be explored to support complex access logic in Sidra Web UI and Sidra API.

New plugin type for pipeline templates installation¶

Sidra architecture is continuously evolving to scale up and out according to growing roadmap requirements and customer needs. Modularity is a key principle of Sidra architecture, and the plugin model is one of the most important examples. In order to allow for a more decoupled evolution of data intake and the rest of the platform components, we have re-used the plugin model for a new use case. A new type of plugin, called pipelinedeployment , has been added to support versioning and easy installation of data intake pipelines even if they are not implemented by a full plugin on the web. This model is useful for versioning and installing bespoke data intake pipelines, outside as much as possible, of Sidra release life-cycle.

Documentation improvements¶

For the purpose of keep improving our transparency and support for the users, Sidra documentation website has included several guides in this new release:

New CLI documentation pages.
New Data Product pages including Data Product deployment.
Added tutorial for data intake from Excel files.
Added tutorial for Schema evolution and data intake from CSV files.

Issues fixed in Sidra 2022.R1¶

Sidra's team is constantly working on improving the product stability, and the following list includes the more relevant bugs fixed and improvements developed as part of this new release:

Fixed an issue when retrieving runId from Databricks, whereby a change in the runId numeration was triggering an error in the Data Product pipelines. #137836
Fixed an issue where the dataset template for a specific activity was passing a wrong reference of the dataset template, therefore the default values were incorrect. #137369
Solved an issue in pipeline templates, by removing index for ItemId and adding a composite index and force the dataset(source)templatepipelinetemplate to be update only when the ItemId IdDataset(source)and PipelineId exists. #137446
Fix 'time' translation cast to string in parquet files in Azure SQL connector plugin #134759
Solved an issue when synchronizing data in some Data Product templates when there are no assets in Core DB. #137402
Fixed an issue in Transfer Query, which failed for fail ingestion when including replaceWhere by IdSourceItem to create delta inserts. #137193
Fixed an issue in CLI tool when executing New-AzADAppCredential for Automation Account. #134290
Fixed an issue in connector plugins that made notifications fail under specific circumstances. #134944
Solved an issue in Persistence Client seed project in Data Products, which did not create the ADF dataset automatically. #134955
Fixed a problem in connector plugins where the validation of existing Provider name was not informing the user correctly. #133847
Fixed an issue in transfer query script that was preventing the data preview tables from being generated due to missing import packages. #134385
Solved an issue with DATE and to high DECIMAL precision type translations in Oracle data sources. #134529
Amended an issue in Sidra CLI that was not setting up properly the Redirect URIs and Identifier URI when creating AAD applications. #135136
Fixed an issue in Identity Server application that was preventing Sidra plugins from being installed correctly. #135176
Added a final status setup for an Asset that does not contain data. #131351
Fixed some incomplete data configuration persistence while deploying Sidra. #133095
Solved an issue where installation verification of Azure CLI stopped execution if there was a new available Azure CLI version. #134623
Corrected some type translation errors when attempting type translation for source types with Oracle databases. #131753
Fixed a cosmetic issue in alignment of elements in the plugins' gallery. #135716
Fixed 'time' translation cast to string in parquet files in Azure SQL connector plugin. #134759
Solved a problem related to internal users in IS that was provoking login and password recovery issues for internal user and password settings. #165506
Fixed an issue in Basic SQL Data Product where the template deployment was not done correctly in dotnet 6. #133201
Corrected an issue with big images in the Data Product detail management page in Sidra web. #131981
Fixed an issue in intake pipelines with incremental loads configured without use change tracking. #135551
Removed unused parameters on Data Product template that are throwing errors at deployment. #127096
Fixed an issue related with compatibility with SQLServer versions lower than 2016. #133559
Solved an issue where DataFactoryManager was deleting resources in the wrong order in Data Products. #131863
Fixed an issue where the GetEntity detail method was not returning the correct imageURL for the Entity. #134300
Amended an issue where GetEntitiesToExport activity was not using the custom Staging table name defined in EntityStagingMapping table. #134358
Corrected an issue where GetStagingTableStatements in the CopyEntityToStaging Data Product pipeline was not using the appropriate dataset. #134372
Fixed an issue where Data Products registration failed when using implicit client authentication. #134236
Fixed an error with stored procedure GrantClientIentityAccess under specific circumstances. #133950
Fixed an issue with metadata extraction pipeline compatibility with SQL Server 2014. #132969
Solved a problem where BatchExtractAndIntakeSQLPictures pipeline was using old database table names used before. #133172
Fixed an issue where BatchExtractAndIntakeSQLMultiDBCompatibilityLevel100 dataset used for activity was hard-coded, taking now the value from extractSqlServerDataToBlobInputDataset parameter. #134522
Solved a problem where BatchExtractAndIntakeSQLMultiDBCompatibilityLevel100 pipeline was not pointing to the properly activity output to take CurrentChangeTrackingValue value. #134530
Fixed and issue where long tags for Entities were failing to be added to the Data Catalogue in Sidra Web. #136059
Fixed an issue in DSU deployment stage when failing to create some resources under some circumstances. #134333
Solved a problem where, for AAD applications, the granting was failing because if the app is just created, it is "locked" so permissions cannot be granted, by adding retries. #129639
Fixed an issue where EntityDTO was not being correctly validated so table model and validation rule were mismatching. #131559
Amended an issue where the Logs cleanup job was failing to execute. #130621
Fixed an issue where the stored procedure to create the staging tables in the Data Product was using the Provider name instead of the database name. #132460
Solved an issue in Data Products queries to the DSU when column names with spaces and other chars were not correctly sanitized. #132556
Added a fix to allow customize database parameters in a Data Product based on SQL. #132519
Solved an issue where the functionality to mark notifications as read from Sidra Web was not working correctly for a very big number of notifications. #132870
Fixed an issue where the call [POST] /api/metadata/providers/raw was not filling all the required values. #133168
Solved an issue when granting permissions to Data Products due to a length cause. #133149
Fixed an issue in an intake pipeline template that avoided to create assets when there was no data from source. #131662
Solved an issue when deploying on a VNet related to FormRecognizer deployment. #130216
Fixed an issue with NUMERIC type translations in SQL plugins. #130502
Amended an issue where the KPI cards for the dashboard in Sidra web were not displaying correctly when more than one DSU was installed. #121480
Fixed an issue in User permissions section in Sidra web due to parent security path not being mandatory. #130863
Fixed an issue with incorrect type translation with DB2 with timestamp values. #130820
Added a fix in transfer queries script to prevent failure caused by parallelization in initial stages of data load. #124109
Solved performance optimizations when synchronizing metadata in Data Products. #128545
Fixed a performance issue in the notifications page in Sidra web when there was a big number of notifications to be displayed. #131604
Added Az.Resources or Az.Automation as part of the prerequisites to install Sidra. #128438
Solved an issue with build pipelines for CLI tool due to pool image not being specified in the template. #130822
Solved a type size issue in JSON column in DW that was causing the stored procedure to fail in the DW. #133670
Solved an issue with installation size L not being correctly configured in Sidra CLI. #128817
Fixed an issue with the connectors' wizard in Sidra web where "number of tables per batch" input was not being validated correctly. #119267
Fixed an issue in rolling back connectors plugin infrastructure when using an existing trigger. #125778
Fixed an issue that avoided to update a deployment if needed. #125523
Solved an issue with IS registration when user email has special characters. #125993
Fixed an issue in Sidra CLI tool, where "deploy source" command logging traces of error incorrectly. #127063
Solved an issue with build pipelines for CLI tool due to pool image not being specified in the template. #130822
Solved issues with SignalR in log files due to non-serializable notifications. #128563
Prevent DeployDataFactoryPipelines step to fail while updating Sidra after recreating AAD Applications. #128910
Solved an issue with password recovery email styling. #125841
Fixed an issue when deploying a Data Product due to missing parameter environmentDescription. #129149
Fixed an issue generating password for databases. #129500
Solved syntax error issue in incremental load pipelines for Azure SQL and SQL Server. #129584
Fixed an issue in AzureSearch preventing Assets from being created after all the Assets batches. #124545
Solved some errors when deploying Sidra on VNet. #129969
Solved an issue with duplicated records in staging tables with same IdAsset. #136592

Breaking Changes in Sidra 2022.R1¶

Change tiers in installation sizes¶

Description¶

This is not a breaking change per se, rather a change that needs to be considered from a cost perspective. The default size for S App Service plan is S2 instead of S1. The rest of the resources stay the same.

The Elastic Pool has also been increased in all sizes, as per what is included in the table below.

Required Action¶

No action is required. These are the recommended sizes to account for the correct functioning of a Sidra installation of this size. However, if is necessary to change, sizes of any resource group can be changed using the CLI optional parameters.

Parameter S M L XL

ElasticPoolDtu 100 200 400 800

ElasticPoolDatabaseDtuMin 0 0 0 0

ElasticPoolDatabaseDtuMax 50 100 200 400

ElasticPoolEdition Standard Standard Standard Standard

Custom naming convention for the Databricks tables and the staging tables in Data Products¶

Description¶

The API now contains some parameters to enable custom table name prefix in the Databricks and the staging tables. This setting is disabled by default. When enabled, you can specify your own table name prefix, or placeholder to name the tables.

Required Action¶

Regarding the need to update or add new metadata, there will be several options of configuration. Also, two different naming convention modes will be possible, depending on the need to use fixed value or placeholders as prefixes in the names of the tables:

If the setting EnableCustomTableNamePrefix is set to false, indepentently of the value set to the parameter CustomTableNamePrefix, the names will follow the new default naming convention provided by Sidra: "databasename_schemaname_tablename".

If the setting EnableCustomTableNamePrefix is set to true, then you can use either a fixed value as the prefix for the naming convention, or a placeholder:

2.1. If using a fixed value to concatenate fixed prefixes to the actual table name, then you need to populate the field CustomTableNamePrefix with a string value. This will concatenate the prefix configured in customTableNamePrefix to the actual name of the table.

2.2. If using a placeholder value to concatenate a dynamic prefix to the actual table name, then you need to populate the field CustomTableNamePrefix with a placeholder string value.

Data Intake Process migrations¶

Description¶

For all configured pipelines created before the release of the Data Intake Process (version 1.11.x - 2022.R1), there is an automated migration process to be applied by support team on every installation environment together with the Sidra update process.

This migration process creates the underlying Data Intake Process objects in Sidra Core metadata database, even if the configured intake pipelines were not created via a connector plugin (e.g. custom pipelines), or were created by a connector plugin that is not supporting this Data Intake Process wizard yet.

Required Action¶

No action is required on user's side:

The described migration script will be planned and executed by support when updating to this release version.

After this migration, users can expect to see in the Data Intake Process list a list with the configured data intake in the environment.

This will not affect in any means the normal functioning of the underlying data extraction pipelines, which will continue with no changes.

Users are also not expected to perform any changes to their current working pipelines.

The migration to create these Data Intake Processes will only therefore be available for the list of supported connector plugins in Sidra.

For more details, please check on the Data Intake Process documentation.

Coming soon...¶

This 2022.R1 release represents an important step towards making the existing pipelines and data ingestion robust enough to handle new scenarios.

Important steps have been taken to set the foundations for the plugin lifecycle model with the incorporation of the Data Intake Process concept, the operational improvements in managing plugins and the usage of the Entity Pipeline association also for plugins.

A new category of data ingestion from complex Excel files enables today a flexible and automated way to extract metadata structures from complex Excel worksheets.

Data querying from the lake has been completed with advanced scenarios of data querying and relationships between Entities.

Additionally, new plugins like the SharePoint connector plugin have been released, using the previously released plugin model. This completes a full end to end binary file ingestion user flow, just by configuring a few parameters from Sidra Web.

As part of the next release, we are planning to increase the plugin catalogue to create and edit Data Intake Processes from Sidra Web, with the release of new plugins for connector and Data Products. New features for schema evolution and for improving binary file ingestion will also be incorporated.

Feedback¶

We would love to hear from you! You can make a product suggestion, report an issue, ask questions, find answers and propose new features by reaching out to us in [email protected].

Parameter	S	M	L	XL
ElasticPoolDtu	100	200	400	800
ElasticPoolDatabaseDtuMin	0	0	0	0
ElasticPoolDatabaseDtuMax	50	100	200	400
ElasticPoolEdition	Standard	Standard	Standard	Standard