Sidra Data Platform (version 2022.R2: Kind Kanzi)¶
released on May 11, 2022
The main focus of this release has been the full migration of .NET Core 3.2 to .Net 6 in Sidra.
However, this release also includes important functional improvements. The most significant of them is the support for schema evolution in source systems of type database. While previous versions of Sidra already detected changes to the data source table schemas and notified the user of these changes, this release brings the support for fully automated schema amendment on the DSU table, adding new Attributes and Entities when they appear in the data source.
Knowledge Store ingestion performance and operational improvements have also been included here to accomplish different scenarios derived from the binary file ingestion, bringing the knowledge store to a fully supported scenario now.
Besides all these changes, some other technical evolution topics have been included in this release, including the update of Databricks cluster runtime to the latest long term support (LTS) version 10.4, and the migration of all the CI/CD Azure DevOps release pipelines to support the current windows-2022 agent.
Sidra also continues to consolidate the operational improvements around Data Products deployment and plugin management.
Sidra 2022.R2 Release Overview¶
This new release was centered around the migration to .Net 6 and Databricks Runtime 10.4, both being LTS releases and ensuring the long term supportability of the platform. In addition to that, we have managed to release the following new features:
- .NET migration
- Windows-2022 release pipelines
- Schema evolution for added columns and tables in source
- Updated Databricks cluster configuration to 10.4 LTS
- Data Product improvements
- Support multiple Data Product pipelines per Entity
- Inclusion of Entity information in Data Products
- Plugin release management improvements
- File indexing artifacts deletion and re-indexing automated process
- Allow SMTP settings different from Sendgrid in Sidra installation
- Knowledge store ingestion performance improvements
- Documentation improvements
What's new in Sidra 2022.R2¶
One of the biggest changes included in this release has been the full migration of .NET version from 3.2 to .NET version 6. This was a critical and much needed effort, since Sidra was using .Net Core 3.2, which was going to go out EOL by the end of 2022. By moving to .Net 6 we are not just reaping the benefits of the newer platform, but with this new version being LTS, we are ensuring supportability until November 2024. This migration work has impacted all Core, DSU and Data Products modules, including:
- Sidra plugins
- Sidra API
- Sidra Data Products
- Identity Server
- Balea authorization framework
- Other Sidra backend modules, e.g backend jobs, backoffice and deployment tools and modules
Windows-2022 release pipelines¶
The Agent Pool in Sidra Azure DevOps Release pipeline has been upgraded to windows-2022, affecting both Core deployment and Data Products pipelines. You can check more information about this technology evolution in this link.
Schema evolution for added columns and tables in source¶
In this release Sidra incorporates a new feature intended to automatically evolve the schema of the source tables, meaning that, whenever new columns of data are added, new Attributes will be created in Sidra Core metadata for each new column added recently.
Following the Sidra process, new columns will be created in their respective Databricks tables for that Entity. Then, once in each execution of the data extraction pipeline, Assets will be created including these new columns in Databricks. Also, whenever new tables are added in the source system, if these tables are configured to be included in the data extraction (in the metadata extraction options of the Data Intake Process configuration), new Entities will be created for these added tables and associated to the data extraction pipeline, so they can be included in the data extraction set.
For more details, you can see the related schema evolution documentation.
Updated Databricks cluster configuration to 10.4 LTS¶
The runtime version of Databricks cluster for DSU resource groups has been upgraded to the latest long term stable (LTS) version that Databricks has released. You can check this information in Azure Databricks documentation.
Support multiple Data Product pipelines per Entity¶
A more robust mechanism to manage multiple Data Product pipelines per Entity has been included.
This has been done through a couple of improvements on the Data Products side:
- Sidra now supports a new PipelineSyncBehavior (Sync webjob that synchronizes metadata between the DSU and the Data Product). This behavior loads any Asset before up the most recent day in which there are available Assets. The Asset status is tracked via the table ExtractPipelineExecution, to work with and track status of multiple pipelines per Entity. This is the mandatory Sync Behavior (PipelineSyncBehavior) for loading an Entity with several pipelines. This is also the recommended SyncBehavior for all new load pipelines.
- All the Data Product pipeline templates have been updated to keep track and update this Client pipeline execution status. You can check this information in Sidra Data Product concepts section and Sidra Data Product pipelines documentation pages.
Data Product improvements¶
This feature consists of a few enhancements on the Data Product side:
- A condition to the orchestrator activity in the Data Product load pipelines has been introduced, to manage when the stored procedure is empty, or creates a new template without the orchestrator.
- All sidra-owned Data Product pipeline templates have been reviewed from the security standpoint, to ensure all sensitive settings are handled by the KeyVault resource inside the Data Product.
- A new Consolidation mode overwrite for overwriting data for Data Products has been added. To enable this mode, a new Data Product pipeline parameter to signal this mode has been added, called PipelineExecutionProperties. In the default consolidation mode merge, the data will be merged if there is a Primary Key, otherwise data will be appended. If the consolidation mode is overwrite, the entire table will be always overwritten.
More information about this feature can be checked in the documentation site.
Inclusion of Entity information in Data Products¶
For this new version of Sidra, information about Data Products from the Asset table in Sidra Core has been included such as row count (Entities), errors (Validation Errors) and byte sizes (ByteSize), in order to keep a detailed tracking of the whole process from the Data Product side. The addition of these controlled fields allows a configuration advanced logic for use cases on the Data Product side.
Plugin release management improvements¶
We have refactored the plugins code to move each plugin to its own release branch independent of Sidra release and other plugins release. Additionally, we have added a new registration endpoint for Sidra plugins, to ease the compatibility set up between each new plugin version and the Sidra version.
File indexing artifacts deletion and re-indexing automated process¶
This feature complements the current capability of Knowledge Store ingestion in Sidra, to support scenarios where we need a semi-automated process to remove artifacts from the index, or prepare the environment for a re-indexing of the documents, avoiding the previously ingested documents from affecting the index. This scenario could happen whenever there is a change in the index structure, an update of the skill model, etc.
A new Core API endpoint has been developed, that orchestrates the different actions required to prepare all the moving pieces of Sidra binary file ingestion to remove old documents from documents index, and intermediate and final data ingestion artifacts (raw storage, index, Knowledge Store, Databricks tables).
This endpoint supports two basic modes. In both modes, the intermediate structures and Asset state for the given Assets ingested by a specific pipeline are removed from Core. The set of these artifacts that are always removed are the following:
- Azure Search datasources
- Knowledge Store files
- Databricks tables and ingested data
With regards to what to do with the raw documents in the intermediate raw storage container, two different modes have been defined for this endpoint:
- Mode Delete: Moves all Asset binary files from all Entities associated to the Azure Search pipeline to the backup container, and set the Asset indexing state to Archived.
- Mode PrepareToReindex: Moves all Asset binary files from the raw container of each of the Entities associated with the pipeline to the /indexlanding container. In this case, in the next execution of the indexer job, it will re-register the Assets and reindex in batch these Assets.
Allow SMTP settings different from Sendgrid in Sidra installation¶
We are now supporting custom SMTP settings, allowing the use of a different SMTP server other than the default Sendgrid account. This feature was required in order to enable certain advanced scenarios for password recovery and other requirements from some of our customers and partners. There are a new set of optional parameters in the CLI to support this at installation/upgrade time. More information is detailed in the Sidra's documentation site.
Knowledge Store ingestion performance improvements¶
This release includes some performance improvements regarding the Knowledge Store ingestion in Sidra. Different settings involved in the different phases of the document indexing have been tested and fine-tuned for improving the performance of the document indexing process.
These settings are resource sizing, degree of parallelism, search units, as well as infrastructure deployment settings for the associated skills executed by the Azure Search skillset.
In addition, an option in Sidra CLI deployment has been added to configure whether the Azure Search service will be deleted and re-created or not. This is a non mandatory parameter, and it is setup false by default. This setting allows the deployment to recreate the Azure Search resource with a different tier.
As well as Sidra Data Platform follows a continuous development, the documentation of Sidra keeps improving in order to ease as much as possible the experience for the user. For that reason, now it includes a section specially dedicated to the Sidra API, showing the different API's endpoints and their schemas depending on the section of interest.
Issues fixed in Sidra 2022.R2¶
Sidra's team is constantly working on improving the product stability, and the following list includes the more relevant bugs fixed and improvements developed as part of this new release:
- Changed the Sidra implementation to handle Databricks jobs to use DatabricksClient. Also, we reviewed all the type definitions for job id handling to match Databricks requirements. #137896
- All Data Product templates provided as accelerators by Sidra have been extensively reviewed to ensure key settings, especially sensitive settings such as Identity Server, which are now stored and retrieved from the KeyVault. #136576
- Fixed a problem with the Connector version not showing correctly when two versions have the same 'ReleaseDate'. #121024
- Fixed an issue when retrieving
runIdfrom Databricks, whereby a change in the
jobIdnumeration was triggering an error in the Data Product pipelines. #137836
- Fixed a problem by which, whenever the metadata pipeline get any table from which it is unable to infer the columns that comprise it, the process ends having a record in the Entities table without any associated Attribute in the Attributes table. The createtables notebook then fails while trying to execute the command to create the table. #137186
- Solved an issue regarding the lack of creation of some KeyVault keys during the deployment. #140072
- Solved an issue where the
apiScopewas not set to
trueby default with
IdentityServer, thus triggering an error during deployment. #140336
- Fixed an issue where the Application Insights resource in Core was not configured as connected to the Log Analytics workspace. #137807
- Solved a concurrency issue when the Sidra API starts after a deployment, and, at the same time, an user begins to create a new Data Intake Process from the web. #136159
- Fixed an issue when recreating AAD applications from Sidra CLI execution, whereby some secrets were not updated with the new ApplicationId and password for Identity Server service principal, causing login into app services fail. #137927
- Solved an issue about the horizontal scrollbar position of the data preview of an Entity, in the Data Catalog. #137290
- Fixed an UI element alignment issue in Authorizations delegation section when no elements are present in the list. #132071
- Solved a specific issue when trying to save a value in KeyVault during Sidra installation by explicitly setting the resource group name in the grant permissions instructions. #140433
- Solved an issue provoking Transfer Query script to fail, when there are not any partition attributes defined. #135192
- Fixed an issue in Oracle plugin, where the data intake failed due to duplicated columns in the extract query. #140707
- Fixed an issue in CLI tool when executing
New-AzADAppCredentialfor Automation Account. #134290
- Solved an issue when executing install 'aadapplications' step in CLI tool, at the step of updating the automation account certificate, caused by a previous version of
- Solved an issue with Data Products renewing password with each deployment, by switching the method to
- Solved an issue where DatabaseBuilder job was failing when deploying Sidra, due to some configuration for Activity Template and Pipeline Template. #140071
- Fixed an issue when executing plugins due to some
TelemetryServiceexception when events are tracked, by taking this service to common solution. #140388
- Fixed an issue in the clear form function in the Sidra plugins wizard that was only clearing when switching between steps. #125987
- Fixed an issue making the Transfer Query fail when
idSourceItemis not defined as a partition field. #135193
- Solved an issue where the cluster version was not updated, making the CreateTable and RunTransferQueries script to fail. #137643
- Fixed an issue with SQL and Databricks Data Product, where the field
Queryin Staging Configuration did not have enough size. #137060
- Fixed an issue the template definition for the extraction pipeline in SQL and Databricks Data Products. #137349
- Solved a problem in Data Intake Process Configuration UI that caused some fields did not appear when they should be activated. #141556
- Fixed an issue creating types in staging tables when the source data type was numeric in Azure SQL and SQL Server. #141712
- Fixed an issue to avoid return a failure response when the status message of a Databricks job execution is empty. #141950
- Fixed a problem in which Identity Server redirects URI with Sidra installations with custom namings. #142028
- Solved an issue where App Insights call from Core API to Identity Server was failing because of insufficient permissions to retrieve information from Graph API. #141921
- Fixed an issue where latest version of the available plugins in Data Intake Process creation was not being installed. #141709
- Fixed an issue where
HubConnection Builderwas not properly used, thus causing some SignalR notifications to fail. #141942
- Solved an issue where variable
serviceConnectionSubscriptionNamefrom CLI was removed due to it was not being used. #141884
- Solved an issue with DB2 plugin when associating a new trigger to data extraction pipeline. #142624
- Fixed a problem where there was a duplicate variable in Core variable group. #142489
- Solved an issue with Databricks consolidation mode in SQL and Databricks Data Product, in the specific case when there is not a primary key in the table. #138804
- Fixed a type issue for
diagnosticSettingsLogsin ARM templates by setting to array type. #142216
- Fixed an issue with Sharepoint plugin in Azure Search mode, where the needed latest versions of the Azure Search templates were not being installed correctly. #141963
- Fixed an issue in the Data Products UI page while searching for Data Products. #130998
- Fixed a contextual help text in Sharepoint metadata extraction Data Intake Process configuration page. #142634
- Fixed an issue where the Sharepoint plugin installation was failing if no Azure Search pipeline versions are available for the Sidra installation. #143233
Breaking Changes in Sidra 2022.R2¶
Substitution of parameters on the CLI tool (Profile command)¶
The CLI tool contains a parameter on the command
profilethat replaces three others in the
createoption, making the configuration easier for the user.
No action is required. The parameters
gitRepositoryhave been replaced by
devOpsRepositoryUrlparameter, the URL of the target Azure DevOps repository, where code and template files are to be copied. It comes in the format like https://dev.azure.com/organizationName/projectName/_git/repositoryName. This way, internally the URL will be split into three profile JSON properties. More information can be found in the
New parameter on the CLI tool (Deploy command)¶
The CLI tool now contains a new parameter on the command
RecreateAzureSearch, in the
configureoption, making the configuration easier for the user.
No action is required.
RecreateAzureSearch parameter can have been setup as true from a last execution additionally to having experienced some change in the currentAzureSearchSku.
No action is required. However, please note that, when doing an upgrading, if the new setting is set to true and an sku change of the Azure Search service is required, this will trigger a full recreation of the service, and will require a full reindexing of the files. In this case, the existing index will be removed.
Because of safety purposes, we are forcing the setting of this parameter at the end of the CLI execution to false, to avoid unintended consequences. In the case that the user purposely wants to recreate Azure Search, it will have to be done manually.
This 2022.R2 release represents an important technological evolution effort to keep up to date with underlying platform improvements, namely .NET Core, Azure DevOps and Databricks.
Important steps have been taken on improving the operations of plugins and Data Product deployments.
Finally, accelerators for automatically detecting source database/tables changes have been developed, as part of the schema evolution features. Future schema evolution will incorporate changes to detect and accommodate deletions of tables/columns in source systems.
As part of the next release, we are planning to increase the plugin catalogue to create and edit Data Intake Processes from Sidra Web.
We would love to hear from you! You can make a product suggestion, report an issue, ask questions, find answers and propose new features at our Sidra ideas portal, or by reaching out to us in [email protected].