Sidra Data Platform (version 2022.R2: Kind Kanzi)¶

released on May 11, 2022

The main focus of this release has been the full migration of .NET Core 3.2 to .Net 6 in Sidra.

However, this release also includes important functional improvements. The most significant of them is the support for schema evolution in source systems of type database. While previous versions of Sidra already detected changes to the data source table schemas and notified the user of these changes, this release brings the support for fully automated schema amendment on the DSU table, adding new Attributes and Entities when they appear in the data source.

Knowledge Store ingestion performance and operational improvements have also been included here to accomplish different scenarios derived from the binary file ingestion, bringing the knowledge store to a fully supported scenario now.

Besides all these changes, some other technical evolution topics have been included in this release, including the update of Databricks cluster runtime to the latest long term support (LTS) version 10.4, and the migration of all the CI/CD Azure DevOps release pipelines to support the current windows-2022 agent.

Sidra also continues to consolidate the operational improvements around Data Products deployment and plugin management.

Sidra 2022.R2 Release Overview¶

This new release was centered around the migration to .Net 6 and Databricks Runtime 10.4, both being LTS releases and ensuring the long term supportability of the platform. In addition to that, we have managed to release the following new features:

.NET migration
Windows-2022 release pipelines
Schema evolution for added columns and tables in source
Updated Databricks cluster configuration to 10.4 LTS
Data Product improvements
Support multiple Data Product pipelines per Entity
Inclusion of Entity information in Data Products
Plugin release management improvements
File indexing artifacts deletion and re-indexing automated process
Allow SMTP settings different from Sendgrid in Sidra installation
Knowledge store ingestion performance improvements
Documentation improvements

What's new in Sidra 2022.R2¶

.NET migration¶

One of the biggest changes included in this release has been the full migration of .NET version from 3.2 to .NET version 6. This was a critical and much needed effort, since Sidra was using .Net Core 3.2, which was going to go out EOL by the end of 2022. By moving to .Net 6 we are not just reaping the benefits of the newer platform, but with this new version being LTS, we are ensuring supportability until November 2024. This migration work has impacted all Core, DSU and Data Products modules, including:

Sidra plugins
Sidra API
Sidra Data Products
Identity Server
Balea authorization framework
Other Sidra backend modules, e.g backend jobs, backoffice and deployment tools and modules

Windows-2022 release pipelines¶

The Agent Pool in Sidra Azure DevOps Release pipeline has been upgraded to windows-2022, affecting both Core deployment and Data Products pipelines. You can check more information about this technology evolution in this link.

Schema evolution for added columns and tables in source¶

In this release Sidra incorporates a new feature intended to automatically evolve the schema of the source tables, meaning that, whenever new columns of data are added, new Attributes will be created in Sidra Core metadata for each new column added recently.

Following the Sidra process, new columns will be created in their respective Databricks tables for that Entity. Then, once in each execution of the data extraction pipeline, Assets will be created including these new columns in Databricks. Also, whenever new tables are added in the source system, if these tables are configured to be included in the data extraction (in the metadata extraction options of the Data Intake Process configuration), new Entities will be created for these added tables and associated to the data extraction pipeline, so they can be included in the data extraction set.

For more details, you can see the related schema evolution documentation.

Updated Databricks cluster configuration to 10.4 LTS¶

The runtime version of Databricks cluster for DSU resource groups has been upgraded to the latest long term stable (LTS) version that Databricks has released. You can check this information in Azure Databricks documentation.

Support multiple Data Product pipelines per Entity¶

A more robust mechanism to manage multiple Data Product pipelines per Entity has been included.

This has been done through a couple of improvements on the Data Products side:

Sidra now supports a new PipelineSyncBehavior (Sync webjob that synchronizes metadata between the DSU and the Data Product). This behavior loads any Asset before up the most recent day in which there are available Assets. The Asset status is tracked via the table ExtractPipelineExecution, to work with and track status of multiple pipelines per Entity. This is the mandatory Sync Behavior (PipelineSyncBehavior) for loading an Entity with several pipelines. This is also the recommended SyncBehavior for all new load pipelines.
All the Data Product pipeline templates have been updated to keep track and update this Client pipeline execution status. You can check this information in Sidra Data Product concepts section and Sidra Data Product pipelines documentation pages.

Data Product improvements¶

This feature consists of a few enhancements on the Data Product side:

A condition to the orchestrator activity in the Data Product load pipelines has been introduced, to manage when the stored procedure is empty, or creates a new template without the orchestrator.
All sidra-owned Data Product pipeline templates have been reviewed from the security standpoint, to ensure all sensitive settings are handled by the KeyVault resource inside the Data Product.
A new Consolidation mode overwrite for overwriting data for Data Products has been added. To enable this mode, a new Data Product pipeline parameter to signal this mode has been added, called PipelineExecutionProperties. In the default consolidation mode merge, the data will be merged if there is a Primary Key, otherwise data will be appended. If the consolidation mode is overwrite, the entire table will be always overwritten.

More information about this feature can be checked in the documentation site.

Inclusion of Entity information in Data Products¶

For this new version of Sidra, information about Data Products from the Asset table in Sidra Core has been included such as row count (Entities), errors (Validation Errors) and byte sizes (ByteSize), in order to keep a detailed tracking of the whole process from the Data Product side. The addition of these controlled fields allows a configuration advanced logic for use cases on the Data Product side.

Plugin release management improvements¶

We have refactored the plugins code to move each plugin to its own release branch independent of Sidra release and other plugins release. Additionally, we have added a new registration endpoint for Sidra plugins, to ease the compatibility set up between each new plugin version and the Sidra version.

File indexing artifacts deletion and re-indexing automated process¶

This feature complements the current capability of Knowledge Store ingestion in Sidra, to support scenarios where we need a semi-automated process to remove artifacts from the index, or prepare the environment for a re-indexing of the documents, avoiding the previously ingested documents from affecting the index. This scenario could happen whenever there is a change in the index structure, an update of the skill model, etc.

A new Core API endpoint has been developed, that orchestrates the different actions required to prepare all the moving pieces of Sidra binary file ingestion to remove old documents from documents index, and intermediate and final data ingestion artifacts (raw storage, index, Knowledge Store, Databricks tables).

This endpoint supports two basic modes. In both modes, the intermediate structures and Asset state for the given Assets ingested by a specific pipeline are removed from Core. The set of these artifacts that are always removed are the following:

Indexes
Indexers
Azure Search datasources
Knowledge Store files
Databricks tables and ingested data

With regards to what to do with the raw documents in the intermediate raw storage container, two different modes have been defined for this endpoint:

Mode Delete: Moves all Asset binary files from all Entities associated to the Azure Search pipeline to the backup container, and set the Asset indexing state to Archived.
Mode PrepareToReindex: Moves all Asset binary files from the raw container of each of the Entities associated with the pipeline to the /indexlanding container. In this case, in the next execution of the indexer job, it will re-register the Assets and reindex in batch these Assets.

Allow SMTP settings different from Sendgrid in Sidra installation¶

We are now supporting custom SMTP settings, allowing the use of a different SMTP server other than the default Sendgrid account. This feature was required in order to enable certain advanced scenarios for password recovery and other requirements from some of our customers and partners. There are a new set of optional parameters in the CLI to support this at installation/upgrade time. More information is detailed in the Sidra's documentation site.

Knowledge Store ingestion performance improvements¶

This release includes some performance improvements regarding the Knowledge Store ingestion in Sidra. Different settings involved in the different phases of the document indexing have been tested and fine-tuned for improving the performance of the document indexing process.

These settings are resource sizing, degree of parallelism, search units, as well as infrastructure deployment settings for the associated skills executed by the Azure Search skillset.

In addition, an option in Sidra CLI deployment has been added to configure whether the Azure Search service will be deleted and re-created or not. This is a non mandatory parameter, and it is setup false by default. This setting allows the deployment to recreate the Azure Search resource with a different tier.

Documentation improvements¶

As well as Sidra Data Platform follows a continuous development, the documentation of Sidra keeps improving in order to ease as much as possible the experience for the user. For that reason, now it includes a section specially dedicated to the Sidra API, showing the different API's endpoints and their schemas depending on the section of interest.

Issues fixed in Sidra 2022.R2¶

Sidra's team is constantly working on improving the product stability, and the following list includes the more relevant bugs fixed and improvements developed as part of this new release:

Changed the Sidra implementation to handle Databricks jobs to use DatabricksClient. Also, we reviewed all the type definitions for job id handling to match Databricks requirements. #137896
All Data Product templates provided as accelerators by Sidra have been extensively reviewed to ensure key settings, especially sensitive settings such as Identity Server, which are now stored and retrieved from the KeyVault. #136576
Fixed a problem with the Connector version not showing correctly when two versions have the same 'ReleaseDate'. #121024
Fixed an issue when retrieving runId from Databricks, whereby a change in the jobId numeration was triggering an error in the Data Product pipelines. #137836
Fixed a problem by which, whenever the metadata pipeline get any table from which it is unable to infer the columns that comprise it, the process ends having a record in the Entities table without any associated Attribute in the Attributes table. The createtables notebook then fails while trying to execute the command to create the table. #137186
Solved an issue regarding the lack of creation of some KeyVault keys during the deployment. #140072
Solved an issue where the apiScope was not set to true by default with IdentityServer, thus triggering an error during deployment. #140336
Fixed an issue where the Application Insights resource in Core was not configured as connected to the Log Analytics workspace. #137807
Solved a concurrency issue when the Sidra API starts after a deployment, and, at the same time, an user begins to create a new Data Intake Process from the web. #136159
Fixed an issue when recreating AAD applications from Sidra CLI execution, whereby some secrets were not updated with the new ApplicationId and password for Identity Server service principal, causing login into app services fail. #137927
Solved an issue about the horizontal scrollbar position of the data preview of an Entity, in the Data Catalog. #137290
Fixed an UI element alignment issue in Authorizations delegation section when no elements are present in the list. #132071
Solved a specific issue when trying to save a value in KeyVault during Sidra installation by explicitly setting the resource group name in the grant permissions instructions. #140433
Solved an issue provoking Transfer Query script to fail, when there are not any partition attributes defined. #135192
Fixed an issue in Oracle plugin, where the data intake failed due to duplicated columns in the extract query. #140707
Fixed an issue in CLI tool when executing New-AzADAppCredential for Automation Account. #134290
Solved an issue when executing install 'aadapplications' step in CLI tool, at the step of updating the automation account certificate, caused by a previous version of Az.Resources and Az.Automation. #137905
Solved an issue with Data Products renewing password with each deployment, by switching the method to GenerateAlphanumeric. #139401
Solved an issue where DatabaseBuilder job was failing when deploying Sidra, due to some configuration for Activity Template and Pipeline Template. #140071
Fixed an issue when executing plugins due to some TelemetryService exception when events are tracked, by taking this service to common solution. #140388
Fixed an issue in the clear form function in the Sidra plugins wizard that was only clearing when switching between steps. #125987
Fixed an issue making the Transfer Query fail when idSourceItem is not defined as a partition field. #135193
Solved an issue where the cluster version was not updated, making the CreateTable and RunTransferQueries script to fail. #137643
Fixed an issue with SQL and Databricks Data Product, where the field Query in Staging Configuration did not have enough size. #137060
Fixed an issue the template definition for the extraction pipeline in SQL and Databricks Data Products. #137349
Solved a problem in Data Intake Process Configuration UI that caused some fields did not appear when they should be activated. #141556
Fixed an issue creating types in staging tables when the source data type was numeric in Azure SQL and SQL Server. #141712
Fixed an issue to avoid return a failure response when the status message of a Databricks job execution is empty. #141950
Fixed a problem in which Identity Server redirects URI with Sidra installations with custom namings. #142028
Solved an issue where App Insights call from Core API to Identity Server was failing because of insufficient permissions to retrieve information from Graph API. #141921
Fixed an issue where latest version of the available plugins in Data Intake Process creation was not being installed. #141709
Fixed an issue where HubConnection Builder was not properly used, thus causing some SignalR notifications to fail. #141942
Solved an issue where variable serviceConnectionSubscriptionName from CLI was removed due to it was not being used. #141884
Solved an issue with DB2 plugin when associating a new trigger to data extraction pipeline. #142624
Fixed a problem where there was a duplicate variable in Core variable group. #142489
Solved an issue with Databricks consolidation mode in SQL and Databricks Data Product, in the specific case when there is not a primary key in the table. #138804
Fixed a type issue for diagnosticSettingsLogs in ARM templates by setting to array type. #142216
Fixed an issue with Sharepoint plugin in Azure Search mode, where the needed latest versions of the Azure Search templates were not being installed correctly. #141963
Fixed an issue in the Data Products UI page while searching for Data Products. #130998
Fixed a contextual help text in Sharepoint metadata extraction Data Intake Process configuration page. #142634
Fixed an issue where the Sharepoint plugin installation was failing if no Azure Search pipeline versions are available for the Sidra installation. #143233

Breaking Changes in Sidra 2022.R2¶

Substitution of parameters on the CLI tool (Profile command)¶

Description¶

The CLI tool contains a parameter on the command profile that replaces three others in the create option, making the configuration easier for the user.

Required Action¶

No action is required. The parameters devOpsProjectUrl, devOpsProject and gitRepository have been replaced by devOpsRepositoryUrlparameter, the URL of the target Azure DevOps repository, where code and template files are to be copied. It comes in the format like https://dev.azure.com/organizationName/projectName/_git/repositoryName. This way, internally the URL will be split into three profile JSON properties. More information can be found in the profile command page.

New parameter on the CLI tool (Deploy command)¶

Description¶

The CLI tool now contains a new parameter on the command deploy, RecreateAzureSearch, in the configure option, making the configuration easier for the user.

Required Action¶

No action is required.

RecreateAzureSearch parameter¶

Description¶

RecreateAzureSearch parameter can have been setup as true from a last execution additionally to having experienced some change in the currentAzureSearchSku.

Required Action¶

No action is required. However, please note that, when doing an upgrading, if the new setting is set to true and an sku change of the Azure Search service is required, this will trigger a full recreation of the service, and will require a full reindexing of the files. In this case, the existing index will be removed.

Because of safety purposes, we are forcing the setting of this parameter at the end of the CLI execution to false, to avoid unintended consequences. In the case that the user purposely wants to recreate Azure Search, it will have to be done manually.

Coming soon...¶

This 2022.R2 release represents an important technological evolution effort to keep up to date with underlying platform improvements, namely .NET Core, Azure DevOps and Databricks.

Important steps have been taken on improving the operations of plugins and Data Product deployments.

Finally, accelerators for automatically detecting source database/tables changes have been developed, as part of the schema evolution features. Future schema evolution will incorporate changes to detect and accommodate deletions of tables/columns in source systems.

As part of the next release, we are planning to increase the plugin catalogue to create and edit Data Intake Processes from Sidra Web.

Feedback¶

We would love to hear from you! You can make a product suggestion, report an issue, ask questions, find answers and propose new features by reaching out to us in [email protected].