About the MySQL connector plugin¶
The MySQL Connector (a Sidra plugin of type connector) enables seamless integration with MySQL enterprise relational database.
You can see more details of what is a Data Intake Process in this page.
Sidra's connector for MySQL allows to configure a Data Intake Process from a MySQL data source in a fast and easy way. The resulting configured Data Intake Process extracts data from any table and view in the source database and loads it into the specified Data Storage Unit at regular intervals. It relies on the Sidra Metadata Model for mapping source data structures to Sidra as destination and uses Azure Data Factory as underlying data integration mechanism within Sidra.
When configuring and executing this plugin of type connector, several underlying steps are involved to achieve the following:
- The necessary metadata and data governance structures are created and populated in Sidra: Data Intake Process, Provider, Entities, Attributes and needed relations among these.
- The actual data integration infrastructure (ADF Pipeline) is created, configured, and deployed.
The resulting Data Intake Process is configured in less than five-minutes. Once the settings are configured and the deployment process is started, the actual duration of the data ingestion from the created pipelines may vary from few minutes to few hours, depending on the data volumes.
After starting the Data Intake Process creation, users will receive a message that the process has started and will continue in the background. Users will be able to navigate through Sidra Web as usual while this process happens.
Once the whole deployment process is finished, users will receive a notification in Sidra Web Notifications widget. If this process went successfully, the new data structures (new Entities) will appear in the Data Catalog automatically, and the Data Intake Process will appear in the Data Intake section including this new data source.
Supported MySQL server versions¶
The supported MySQL server version by this connector plugin is:
- MySQL 8.0.
Supported data synchronization mechanisms¶
The MySQL connector supports different modes of data synchronization:
Full load data synchronization: Generally performed for first time loads. This is also the default mode if no alternative incremental load mechanism is defined. By default, the first load will be a complete load.
Incremental load data synchronization: This data synchronization mechanism captures updates for any new or modified data from the source database. Only data corresponding to the updates since the last synchronization is retrieved.
For incremental load to work, there must be a defined mechanism to capture updates. For the first version of the MySQL plugin, incremental load is supported by the Sidra built-in syncronization method. This method relies on specific configurations in the Sidra Metadata EntityDeltaLoad table. In such table different possibilities are allowed to define the Attribute or Attributes to track the changes, or even the custom expression to calculate such changes.
More information about these load mechanisms is included in further sections.
The process of setting up a MySQL Data Intake Process involves several key actions. The different sections of the input form are organized into these main steps.
Step 1. Configure Data Intake Process¶
Please see on the common Sidra connector plugins page about the parameters needed to configure a Data Intake Process.
Step 2. Configure Provider¶
Please see on the common Sidra connector plugins page about the parameters needed to create a new Provider.
Step 3. Configure Data Source¶
The data source represents the connection to the source database. A Data Source abstracts the details of creating a Linked Service in Azure Data Factory.
Some key aspects to take into account in this section are:
The fields required are the Integration Runtime and the connection string to the database.
Sidra MySQL connector plugin will register this new data source in Sidra Metadata and deploy a Linked Service in Azure Data factory with this connection.
The created Linked Service will use Key Vault in Sidra Core to store the connection string to the database.
Default value for Integration Runtime is
For more details on Linked Services check the Data Factory documentation.
The information that needs to be provided is the following:
Integration Runtime: this is the ADF integration runtime in charge of executing the connection between the data origin, and is associated with a deployed Linked Service. In case no IR is required, the option Default will be selected.
Username: user with access to the database.
Password: user password.
Step 4. Configure Metadata Extraction¶
Sidra MySQL connector plugin replicates the schema and tables from the source database, by querying the
INFORMATION_SCHEMA.TABLES System View tables.
The extracted schema is replicated to the Sidra Metadata model hierarchy. Tables in the database source system are replicated in Sidra as Entities, and columns of these tables, as Attributes.
The Entity table in Sidra metadata model contains data about the format of the Entity generated, as well as information about how this Entity should be handled by the system. Attributes contain data about the columns of the tables. Sidra adds additional system-level Atributes to convey system-handling information at the level of the Entity (or table).
As part of the metadata extraction of a Data Intake Process setup, some coding structures are also auto generated, such as the transfer query scripts that transform the data in raw storage into the optimized format in the Data Lake.
For more details, please check the Data Ingestion documentation.
The metadata extraction represents in which way the origin will be consumed in order to obtain its metadata. The information fields required to fill in the Metadata extraction section will be used for three main purposes internally in Sidra:
- Create and populate the Sidra Core metadata structures (Entities and Attributes).
- Create, deploy and execute an Azure Data Factory pipeline for the metadata extraction steps (see below steps).
- Auto-generate the create tables and transfer query scripts for each Entity, including all the steps for storing the optimized data in the data lake storage.
These are the metadata extractor configuration parameters to be input for this version of connector plugin. All of these parameters are optional:
Number of Tables per Batch: this is used internally in the metadata extractor pipeline in order to specify the number of tables that will be consumed per batch.
Types of Objects to Load: views, tables or both.
Object Restriction Mode: there can be three options here:
Include all objects: all objects from all databases will be imported (by default).
Include some objects: to specify a list of objects to include.
Exclude some objects: to specify a list of objects to exclude.
Object Restriction List: this is the list of objects (either Tables or Views) in the source database to include or exclude from the metadata extraction, depending on the
Object Restriction Modeconfigured. It corresponds to
TABLE_NAMEfields in the
INFORMATION_SCHEMA.TABLES. If no values are specified in this field, all objects will be loaded. To fill in this field, add a list of [
].[ ] or [ ] elements in plain text, separated by comma. Examples: [classicmodels].[customers], [classicmodels].
Once the metadata extractor pipeline has been created, it is executed. This execution could take a while (from 5 to 15 minutes) depending on the load of Data Factory and the status and load of the associated Databricks cluster. Sidra exposes API endpoints as well to manually execute the SQL Metadata extractor pipeline.
Step 5. Configure Trigger¶
Please see on the common Sidra connector plugins page about the parameters needed to set up a trigger.
MySQL data types and transformations¶
Some of the original MySQL data types are not supported by Azure Data Factory conversions to Parquet (the format of storage in Sidra DSU).
For those data types that are not supported, Sidra connectors as plugins incorporate some type translation mechanisms in place, used both at metadata extraction and at data extraction phases.
If a certain data type is not supported, that data type is automatically changed to the closest supported type as defined in a Type Translations table.
You can find more information about the general process for type translations for plugins is in this page.
Data Extraction pipeline¶
Once with all the information provided in the Configuration steps section, Sidra Core will create and deploy the actual data extractor pipeline. The data extraction pipeline is where the actual movement and transformation of data happens:
- On one hand, the
copy dataADF activities are executed, which actually move the data between the source (MySQL Database) and the destination (Azure Data Lake Gen2).
- On the other hand, the transfer query scripts are executed for each Entity in order to perform data optimization, data validation, etc and loading the data in its optimized format in the data lake.
The time to execute this pipeline is variable depending on the volumne of data and the environment.
Initial full data synchronization¶
Once Sidra is connected to the source database, Sidra MySQL Database connector plugin first copies all rows from every table in every schema and table that has not been explicitly set to be excluded.
For each table (Entity), rows are copied by performing a SELECT statement. Copy Activity in Azure Data Factory parallelizes reads and writes according to the source and destination.
Loading incremental data mechanisms¶
Once an initial synchronization is complete, Sidra performs incremental synchronizations on the new and modified data in the source system.
Sidra Connector plugin for MySQL Database uses the following mechanism for incremental updates: Non-Change Tracking mechanism enabled by Sidra and Non-Change Tracking custom mechanism enabled by Sidra.
Sidra MySQL database connector plugin supports an option to use an incremental query based in parametrization.
This parametrization is provided by the
DeltaAttribute, which is the Attribute that is used for the incremental extraction.
This Attribute acts as a unique key to identify new inserts, deletes or updates in the source database. Sidra generates a query that filters
by searching the differences according to this column.
When rows are deleted in the source database, Sidra does not delete rows from the destination storage. A logical delete happens in Sidra instead. Logical deletion is supported in Sidra through a metadata field called
DeltaAttribute is not present in the Entity, Sidra Core uses two additional columns, also in the EntityDeltaLoad table in Sidra Core.
These columns act as a unique key to identify new inserts, deletes or updates in the source database. Sidra generates a query that filters by searching the differences according to these columns.
CustomExtractMaxValueExpression: This value contains the custom expression used to get the maximum value for incremental load.
CustomExtractDataExpression: This value contains the custom expression used to query the source table to get the incremental data. This allows to code more complex expressions than just using a
This Attribute is described in the EntityDeltaLoad table in Sidra Core.