About the Microsoft SQL Server database connector

The SQL Server connector for Sidra enables seamless integration with Microsoft powerful enterprise relational database.

Sidra connector for SQL Server extracts data from any table and view in the source database and loads it into the specified Data Storage Unit at regular intervals. It relies on the Sidra Metadata model for mapping source data structures to Sidra as destination, and uses Azure Data Factory as underlying data integration mechanism within Sidra.

When configuring and executing this connector, several underlying steps are involved to achieve the following:

  • The necessary metadata and data governance structures are created and populated in Sidra.
  • The actual data integration infrastructure (ADF Pipeline) is created, configured and deployed.

The connector is configured in less than five-minutes. Once the settings are configured and the deployment process is started, the actual duration of the data ingestion may vary from few minutes to few hours, depending on the data volumes.

After starting the connector creation process, users will receive a message that the process has started and will continue in the background. Users will be able to navigate through Sidra Web as usual while this process happens.

Once the whole deployment process is finished, users will receive a notification in Sidra Web Notifications widget. If this process went successfully, the new data structures (new Entities) will appear in the Data Catalog automatically, and the data intake process will incorporate this new data source.

Supported SQL Sever versions

The following list includes all SQL Server versions supported by this connector:

  • SQL Server 2012 (version 11.xx)
  • SQL Server 2014 (version 12.xx)
  • SQL Server 2016 (version 13.xx)
  • SQL Server 2017 (version 14.xx)
  • SQL Server 2019 (version 15.xx)

Note: All editions (Developer, Standard and Enterprise) are supported, but some features of the connector will only be available if the source SQL Server edition supports the feature, such as the Enterprise edition requirement for Change Tracking on tables.

Supported SQL Server data synchronization mechanisms

The SQL Server connector supports different modes of data synchronization, which also depend on the mechanisms configured on the source system or Sidra and the source system edition of SQL Server:

  • Full load data synchronization: Generally performed for first time loads. This is also the default mode if no Change Tracking is enabled in the source system, nor alternative incremental load mechanism is defined. By default, the first load will be a complete load.

  • Incremental load data synchronization: This data synchronization mechanism captures updates for any new or modified data from the source database. Only data corresponding to the updates since the last synchronization is retrieved.

For incremental load to work, there must be a defined mechanism to capture updates in the source system. For incremental load data synchronization, two possible types of mechanisms are supported:

  • Incremental Load with built-in SQL Server Change Tracking (CT): This is achieved by directly activating Change Tracking in the source database.

  • Incremental Load non-Change Tracking related (non-CT): This is achieved by specific configurations in the Sidra Metadata tables.

NOTE: for configuring incremental load non-change tracking related, this configuration needs to be manually added to Sidra metadata as described below. This configuration is not part of the scope of the connectors wizard.

More information about these load mechanisms is included in further sections.

Steps included in the SQL Server connector configuration

The process of setting up an SQL Server connector involves several key actions in the background once the needed configuration data has been collected from the user.

The different sections of the input form are organized into these main sections:

  • Configure Provider
  • Configure Data Source
  • Configure Metadata Extractor options
  • Configure Trigger

Section 1. Configure Provider

Please see on the commmon connectors page about the parameters needed to create a new Provider.

Section 2. Configure Data Source

The data source represents the connection to the source database. A Data Source abstracts the details of creating a Linked Service in Azure Data Factory. The fields required in this section are the Integration Runtime and the connection string to the database. Sidra SQL Server connector will register this new data source in Sidra Metadata and deploy a Linked Service in Azure Data factory with this connection. The created Linked Service will use Key Vault in Sidra Core to store the connection string to the database. Default value for Integration Runtime is AutoResolveIntegrationRuntime.

For more details on Linked Services check the Data Factory documentation.

The information that needs to be provided is the following:

  • Integration Runtime: this is the ADF integration runtime in charge of executing the connection between the data origin, and is associated with a deployed Linked Service. In case it is not required any IR, the option Default will be selected.
  • Server Name: the name of the server to connect to.
  • Database Name: the data catalogue in the server
  • Username: user with access to the origin server
  • Password: user password
  • Encrypt: for the connection string (true/false). By default, true is used.
  • Trust Server Certificate: for the connection string (true/false). By default, false is used.

Section 3. Configure SQL Metadata Extraction

Sidra Azure SQL Database connector replicates the schema and tables from the source database, by querying the INFORMATION_SCHEMA System View tables.

The extracted schema is replicated to the Sidra Metadata model hierarchy. Tables in the SQL Database source system are replicated in Sidra as Entities, and columns of these tables, as Attributes.

The Entity table in Sidra metadata model contains data about the format of the Entity generated, as well as information about how this Entity should be handled by the system. Attributes contain data about the columns of the tables. Sidra adds additional system-level attributes to convey system-handling information at the level of the Entity (or table).

As part of the metadata extraction of a connector setup, some coding structures are also auto generated, such as the transfer query scripts that transform the data in raw storage into the optimized format in the Data Lake. For more details, please check the Data Ingestion documentation.

The metadata extraction represents in which way the origin will be consumed in order to obtain its metadata. The information fields required to fill in the SQL Metadata extraction section will be used for three main purposes internally in Sidra:

  • Create and populate the Sidra Core metadata structures (Entities and Attributes).
  • Create, deploy and execute an Azure Data Factory pipeline for the metadata extraction steps (see below steps).
  • Auto-generate the create tables and transfer query scripts for each Entity, including all the steps for storing the optimized data in the data lake storage.

These are the metadata extractor configuration parameters to be input for this version of connector. All of these parameters are optional:

  • Number of tables per batch: this is used internally in the metadata extractor pipeline in order to specify the number of tables that will be consumed per batch.
  • Types of objects to load: views, tables or both.
  • Objects to exclude: this is a list with comma separated values, which includes those objects that we do not want to load into the system. Entity metadata for these objects will therefore not be created.

Once the metadata extractor pipeline has been created, it is executed. This execution could take a while (from 5 to 15 minutes) depending on the load of Data Factory and the status and load of the associated Databricks cluster. Sidra exposes API endpoints as well to manually execute the SQL Metadata extractor pipeline.

Section 4. Configure Trigger

Please see on the commmon connectors page about the parameters needed to set up a trigger.

SQL data types and transformations

Some of the original SQL Server data types are not supported by Azure Data Factory conversions to Parquet (the format of storage in Sidra DSU).

For those data types that are not supported, Sidra connectors as plugins incorporate some type translation mechanisms in place, used both at metadata extraction and at data extraction phases.

If a certain data type is not supported, that data type is automatically changed to the closest supported type as defined in a Type Translations table.

You can find more information about the general process for type translations for plugins is in this page.

How to exclude source data

Sidra data connector for SQL Server allows to control different settings on which tables will be part of the data to be synchronized between source and destination:

  • List of tables to include: this field allows to specify whether only original source tables, views or both will be part of the data synchronization. Possible values are table and view. These values correspond to the table_type values in INFORMATION_SCHEMA.TABLES SQL Server System View. If not specified, all table types will be included.

  • List of tables to exclude: this field allows to specify certain schemas and tables in the source database that will not be part of the data synchronization. You can check the list of schemas and tables in the INFORMATION_SCHEMA.TABLES SQL Server System View. If no values are specified in this field, no tables will be excluded. To fill in this field, add a list of [].[] elements, separated by comma.

Examples: [dbo].[VariantTable], [Sales_LT].[SalesOrderHeader].

Data Extraction pipeline

Once with all the information provided in the above steps, Sidra Core will create and deploy the actual data extractor pipeline. The data extraction pipeline is where the actual movement and transformation of data happens:

  • On one hand, the copy data ADF activities are executed, which actually move the data between the source (SQL Server database) and the destination (Azure Data Lake Gen2).
  • On the other hand, the transfer query scripts are executed for each Entity in order to perform data optimization, data validation, etc and loading the data in its optimized format in the data lake.

The time to execute this pipeline is variable depending on the volumne of data and the environment.

Initial full data synchronization

Once Sidra is connected to the source database, Sidra SQL Server connector first copies all rows from every table in every schema and table that has not been explicitly set to be excluded (see above on Excluding Source Data).

For each table (Entity), rows are copied by performing a SELECT satement. Copy Activity in Azure Data Factory parallelizes reads and writes according to the source and destination. A good practice when using Change Tracking, is that the field by which the changes are calculated is indexed in the source database. This avoids needing to read the whole table.

Loading incremental data mechanisms

Once an initial synchronization is complete, Sidra performs incremental synchronizations on the new and modified data in the source system.

Sidra Connector for SQL Server database uses the following change tracking mechanisms for incremental updates:

  • Built-in Change Tracking (CT):

Change Tracking records when a row in a table has changed. It does not capture the data that was changed or how many times it changed. Change Tracking requires primary keys defined in the source database, in order to identify which rows have changed.

For more information on how to activate Change Tracking in an SQL database, please check the Microsoft Documentation.

If CT is enabled on a table in the source system, Sidra core will use CT as the incremental update mechanism. This is the most efficient mechanism for detecting new, deleted or updated entries in the source database.

Change Tracking creates change records that the data extractor pipeline in Sidra Core accesses on a per-table (Entity) basis during incremental updates.

When Change Tracking is enabled, the data extractor pipeline will load data depending on the last value of LastDeltaValue. LastDeltaValue is a value stored in the EntityDeltaLoad table in Sidra Core metadata. If there is no LastDelta Value, the load will be marked as Initial Load, and if there are new values from the latest LastDeltaValue, the load will be marked as Incremental Load.

Additionally, Sidra Core incorporates an option for signalling when to reload all the changes (CHANGE_TRACKING_MIN_VALID_VERSION) in SQL to control when Change Tracking can be used to load the changes, and when we need to overwrite the entire table. This is useful for advanced controlling scenarios of expiration of data retention period.

If a table does not have any changes at scheduled data extraction time, an empty asset (0 bytes) in Sidra Core database is generated in order to specify that the scheduled load was successful but generated an empty set. This is needed to ensure consistency between Sidra Core and Client Applications.

The current version of Sidra connector does not support advanced schema evolution. This means that if new columns are added in the source table, the metadata for this table in Sidra would need to be explicitly modified outside of this connector. Sidra API supports metadata extractor methods for SQL type of sources.

Deleted Rows

When rows are deleted in the source database, Sidra does not delete rows from the destination storage. A logical delete happens in Sidra instead. Logical deletion is supported in Sidra through a metadata field called Sidra_IsRemoved.

Sidra supports delete operations in Change Tracking incremental load. The delete support has been incorporated to Transfer Query scripts.

Change Tracking expiration

Each table using Change Tracking has a minimum valid version. This number represents the version of the database at the moment when the first trackable change in this table was done.

This valid version determines an expiration period of Change Tracking, so that any change older than the minimum valid version cannot be recovered using Change Tracking mechanism.

The minimum version depends on the retention period (by default retention period is 2 days). That means that changes in the source table older than the retention period might not be retrievable.

You can check the minimum valid version of a table using the following T-SQL command:

1
SELECT CHANGE_TRACKING_MIN_VALID_VERSION(OBJECT_ID('dbo.Employees'));

When a table is truncated, as those changes are not logged by Change Tracking, the minimum version is updated, so changes prior to the truncate cannot be retrieved.

In any of the circmstances, when the minimum valid version is bigger than the last version imported to Sidra, or when the table is truncated, the system needs to reload the entire table and replace the old data, before starting to use Change Tracking again for incremental load.

When the data extraction pipeline detects this situation, the Entity associated to the tables is marked with the flag Need Reloading in the EntityDeltaLoad table. That is, the NeedReload field in DataIngestion.EntityDeltaLoad table is set to 1.

By default, this reloading does not happen until the field EnableReload in the same table is set to 1 by the user. This is done to prevent heavy loads during working hours, which could degrade the source database performance.

There is a way to make this reloading happen automatically, without need for user intervention, and no matter the value of EnableReload.

Just add the property autoReload to the AdditionalProperties json field in DataIngestion.Entity table:

1
2
3
4
5
6
7
8
9
{
"sourceTableName":"[dbo].[mytable]",
"sourceServer":"myserver.database.windows.net",
"sourceDatabase":"mydatabase",
"sourceSchema":"dbo",
"sourceTable":"mytable", 
"consolidationMode": "Merge", 
"autoReload": "true"
}

Note that this must be done on a per Entity basis.

  • Non-Change Tracking mechanism enabled by Sidra:

Besides the SQL Server built-in Change Tracking, Sidra SQL database connector also supports an option to use an incremental query based in parametrization.

This parametrization is provided by the DeltaAttribute, which is the Attribute that is used for the incremental extraction. This Attribute acts as a unique key to identify new inserts, deletes or updates in the source database. Sidra generates a query that filters by searching the differences according to this column. This mechanism is not as efficient as the built-in Change Tracking mechanism.

This Attribute is described in the EntityDeltaLoad table in Sidra Core.

  • Non-Change Tracking custom mechanism enabled by Sidra:

If the DeltaAttribute is not present in the Entity, Sidra Core uses two additional columns, also in the EntityDeltaLoad table in Sidra Core.

These columns act as a unique key to identify new inserts, deletes or updates in the source database. Sidra generates a query that filters by searching the differences according to these columns. This mechanism is not as efficient as the built-in Change Tracking mechanism.

  • CustomExtractMaxValueExpression: This value contains the custom expression used to get the maximum value for incremental load.

  • CustomExtractDataExpression: This value contains the custom expression used to query the source table to get the incremental data. This allows to code more complex expressions than just using a DeltaAttribute.