About the Azure SQL database connector¶
The Azure SQL database connector for Sidra enables seamless integration with the most widely used SQL Server database as a Service on Azure; an intelligent, scalable, relational database service built in for the cloud. Azure SQL Database uses the latest SQL Server capabilities, and you can learn more about it on Microsoft Documentation.
Sidra connector for Azure SQL database extracts data from any table and view in the source database and loads it into the specified Data Storage Unit at regular intervals. It relies on the Sidra Metadata model for mapping source data structures to Sidra as destination, and uses Azure Data Factory as underlying data integration mechanism within Sidra.
When configuring and executing this connector, several underlying steps are involved to achieve the following:
- The necessary metadata and data governance structures are created and populated in Sidra.
- The actual data integration infrastructure (ADF Pipeline) is created, configured and deployed.
The connector is configured in less than five-minutes. Once the settings are configured and the deployment process is started, the actual duration of the data ingestion may vary from few minutes to few hours, depending on the data volumes.
After starting the connector creation process, users will receive a message that the process has started and will continue in the background. Users will be able to navigate through Sidra Web as usual while this process happens.
Once the whole deployment process is finished, users will receive a notification in Sidra Web Notifications widget. If this process went successfully, the new data structures (new Entities) will appear in the Data Catalog automatically, and the data intake process will incorporate this new data source.
Supported data synchronization mechanisms¶
The Azure SQL Database connector for Sidra supports different modes of data synchronization, which also depend on the mechanisms configured on the source system or Sidra:
Full load data synchronization: Generally performed for first time loads. This is also the default mode if no Change Tracking is enabled in the source system, nor alternative incremental load mechanism is defined. By default, the first load will be a complete load.
Incremental load data synchronization: This data synchronization mechanism captures updates for any new or modified data from the source database. Only data corresponding to the updates since the last synchronization is retrieved.
For incremental load to work, there must be a defined mechanism to capture updates in the source system. For incremental load data synchronization, two possible types of mechanisms are supported:
Incremental Load with built-in SQL Server Change Tracking (CT): This is achieved by directly activating Change Tracking in the source database.
Incremental Load non-Change Tracking related (non-CT): This is achieved by specific configurations in the Sidra Metadata tables.
NOTE: for configuring incremental load non-change tracking related, this configuration needs to be manually added to Sidra metadata as described below. This configuration is not part of the scope of the connectors wizard.
You can see more details about how this connector works and a setup guide in the documentation.
More information about these load mechanisms is included in further sections.
Steps included in the Azure SQL Database connector configuration¶
The process of setting up an Azure SQL Databse connector involves several key actions in the background once the needed configuration data has been collected from the user.
The different sections of the input form are organized into these main sections:
- Configure Provider
- Configure Data Source
- Configure Metadata Extractor options
- Configure Trigger
Section 1. Configure Provider¶
Please see on the commmon connectors page about the parameters needed to create a new Provider.
Section 2. Configure Data Source¶
The data source represents the connection to the source database. A Data Source abstracts the details of creating a Linked Service in Azure Data Factory. The fields required in this section are the Integration Runtime and the connection string to the database. Sidra Azure SQL Database connector will register this new data source in Sidra Metadata and deploy a Linked Service in Azure Data factory with this connection. The created Linked Service will use Key Vault in Sidra Core to store the connection string to the database. Default value for Integration Runtime is AutoResolveIntegrationRuntime.
For more details on Linked Services check the Data Factory documentation.
Section 3. Configure SQL Metadata Extraction¶
Sidra Azure SQL Database connector replicates the schema and tables from the source database, by querying the INFORMATION_SCHEMA System View tables.
The extracted schema is replicated to the Sidra Metadata model hierarchy. Tables in the SQL Database source system are replicated in Sidra as Entities, and columns of these tables, as Attributes.
The Entity table in Sidra metadata model contains data about the format of the Entity generated, as well as information about how this Entity should be handled by the system. Attributes contain data about the columns of the tables. Sidra adds additional system-level attributes to convey system-handling information at the level of the Entity (or table).
As part of the metadata extraction of a connector setup, some coding structures are also auto generated, such as the transfer query scripts that transform the data in raw storage into the optimized format in the Data Lake. For more details, please check the Data Ingestion documentation.
The metadata extraction represents in which way the origin will be consumed in order to obtain its metadata. The information fields required to fill in the SQL Metadata extraction section will be used for three main purposes internally in Sidra:
- Create and populate the Sidra Core metadata structures (Entities and Attributes).
- Create, deploy and execute an Azure Data Factory pipeline for the metadata extraction steps (see below steps).
- Auto-generate the create tables and transfer query scripts for each Entity, including all the steps for storing the optimized data in the data lake storage.
These are the metadata extractor configuration parameters to be input for this version of connector. All of these parameters are optional:
- Number of tables per batch: this is used internally in the metadata extractor pipeline in order to specify the number of tables that will be consumed per batch.
- Types of objects to load: views, tables or both.
- Objects to exclude: this is a list with comma separated values, which includes those objects that we do not want to load into the system. Entity metadata for these objects will therefore not be created.
Once the metadata extractor pipeline has been created, it is executed. This execution could take a while (from 5 to 15 minutes) depending on the load of Data Factory and the status and load of the associated Databricks cluster. Sidra exposes API endpoints as well to manually execute the SQL Metadata extractor pipeline.
Section 4. Configure Trigger¶
Please see on the commmon connectors page about the parameters needed to set up a trigger.
SQL data types and transformations¶
Some of the original SQL Server data types are not supported by Azure Data Factory conversions to Parquet (the format of storage in Sidra DSU).
For those data types that are not supported, Sidra connectors as plugins incorporate some type translation mechanisms in place, used both at metadata extraction and at data extraction phases.
If a certain data type is not supported, that data type is automatically changed to the closest supported type as defined in a Type Translations table.
You can find more information about the general process for type translations for plugins is in this page.
How to exclude source data¶
Sidra data connector for Azure SQL Database allows to control different settings on which tables will be part of the data to be synchronized between source and destination:
List of tables to include: this field allows to specify whether only original source tables, views or both will be part of the data synchronization. Possible values are table and view. These values correspond to the table_type values in INFORMATION_SCHEMA.TABLES SQL Server System View. If not specified, all table types will be included.
List of tables to exclude: this field allows to specify certain schemas and tables in the source database that will not be part of the data synchronization. You can check the list of schemas and tables in the INFORMATION_SCHEMA.TABLES SQL Server System View. If no values are specified in this field, no tables will be excluded. To fill in this field, add a list of [
].[ ] elements, separated by comma.
Examples: [dbo].[VariantTable], [Sales_LT].[SalesOrderHeader].
Data Extraction pipeline¶
Once with all the information provided in the above steps, Sidra Core will create and deploy the actual data extractor pipeline. The data extraction pipeline is where the actual movement and transformation of data happens:
- On one hand, the copy data ADF activities are executed, which actually move the data between the source (Azure SQL Database) and the destination (Azure Data Lake Gen2).
- On the other hand, the transfer query scripts are executed for each Entity in order to perform data optimization, data validation, etc and loading the data in its optimized format in the data lake.
The time to execute this pipeline is variable depending on the volumne of data and the environment.
Initial full data synchronization¶
Once Sidra is connected to the source database, Sidra Azure SQL Database connector first copies all rows from every table in every schema and table that has not been explicitly set to be excluded (see above on Excluding Source Data).
For each table (Entity), rows are copied by performing a SELECT satement. Copy Activity in Azure Data Factory parallelizes reads and writes according to the source and destination. A good practice when using Change Tracking, is that the field by which the changes are calculated is indexed in the source database. This avoids needing to read the whole table.
Loading incremental data mechanisms¶
Once an initial synchronization is complete, Sidra performs incremental synchronizations on the new and modified data in the source system.
Sidra Connector for Azure SQL Database uses the following change tracking mechanisms for incremental updates:
- Built-in Change Tracking (CT):
Change Tracking records when a row in a table has changed. It does not capture the data that was changed or how many times it changed. Change Tracking requires primary keys defined in the source database, in order to identify which rows have changed.
For more information on how to activate Change Tracking in an SQL database, please check the Microsoft Documentation.
If CT is enabled on a table in the source system, Sidra core will use CT as the incremental update mechanism. This is the most efficient mechanism for detecting new, deleted or updated entries in the source database.
Change Tracking creates change records that the data extractor pipeline in Sidra Core accesses on a per-table (Entity) basis during incremental updates.
When Change Tracking is enabled, the data extractor pipeline will load data depending on the last value of LastDeltaValue. LastDeltaValue is a value stored in the EntityDeltaLoad table in Sidra Core metadata.
If there is no LastDelta Value, the load will be marked as Initial Load, and if there are new values from the latest LastDeltaValue, the load will be marked as Incremental Load.
Additionally, Sidra Core incorporates an option for signalling when to reload all the changes (CHANGE_TRACKING_MIN_VALID_VERSION) in SQL to control when Change Tracking can be used to load the changes, and when we need to overwrite the entire table. This is useful for advanced controlling scenarios of expiration of data retention period.
If a table does not have any changes at scheduled data extraction time, an empty asset (0 bytes) in Sidra Core database is generated in order to specify that the scheduled load was successful but generated an empty set. This is needed to ensure consistency between Sidra Core and Client Applications.
The current version of Sidra connector does not support advanced schema evolution. This means that if new columns are added in the source table, the metadata for this table in Sidra would need to be explicitly modified outside of this connector. Sidra API supports metadata extractor methods for SQL type of sources.
When rows are deleted in the source database, Sidra does not delete rows from the destination storage. A logical delete happens in Sidra instead. Logical deletion is supported in Sidra through a metadata field called Sidra_IsRemoved.
Sidra supports delete operations in Change Tracking incremental load. The delete support has been incorporated to Transfer Query scripts
- Non-Change Tracking mechanism enabled by Sidra:
Besides the SQL Server built-in Change Tracking, Sidra SQL database connector also supports an option to use an incremental query based in parametrization.
This parametrization is provided by the DeltaAttribute, which is the Attribute that is used for the incremental extraction. This Attribute acts as a unique key to identify new inserts, deletes or updates in the source database. Sidra generates a query that filters by searching the differences according to this column. This mechanism is not as efficient as the built-in Change Tracking mechanism.
This Attribute is described in the EntityDeltaLoad table in Sidra Core.
- Non-Change Tracking custom mechanism enabled by Sidra:
If the DeltaAttribute is not present in the Entity, Sidra Core uses two additional columns, also in the EntityDeltaLoad table in Sidra Core.
These columns act as a unique key to identify new inserts, deletes or updates in the source database. Sidra generates a query that filters by searching the differences according to these columns. This mechanism is not as efficient as the built-in Change Tracking mechanism.
CustomExtractMaxValueExpression: This value contains the custom expression used to get the maximum value for incremental load.
CustomExtractDataExpression: This value contains the custom expression used to query the source table to get the incremental data. This allows to code more complex expressions than just using a DeltaAttribute.