About Sidra plugins and Data Intake Processes¶
This page describes some concepts related to Data Intake Processes in Sidra and how to configure them with the help of Sidra connector plugins.
A Data Intake Process in Sidra is an abstraction concept that relates a set of configurations for intaking data from a given data source (e.g., SQL Server database, Sharepoint library, etc.), as well as all the related data extraction infrastructure generated for the data intake (e.g., metadata, trigger, data extraction pipelines).
A Sidra plugin, on the other hand, is an internal architectural concept in Sidra to refer to an assembly of code that is installed and executed to connect to a source system. When a plugin encapsulates code to configure a Data Intake Process, it is referred to also as connector plugin, or plugin of type connector.
Before explaining more in depth what a Sidra plugin is, let's try to review some concepts regarding the configuration of data intake in Sidra:
When configuring and executing a new data intake for a new or existing data source system, several underlying steps are involved:
- On one hand, the necessary metadata and data governance structures are created in Sidra. This includes the creation of the data source that is specific to the engine and type of the source system (e.g., SQL Server, Azure SQL, DB2 database, etc.). This also includes the generation of the Asset metadata.
- On the other hand, the actual data integration infrastructure elements (e.g., ADF pipelines and triggers) are created, configured and deployed.
Sidra API provides different API methods for creating the needed metadata structures to configure a new Data Intake Process. This includes creating the Provider, creating the Data Source and creating and deploying the pipelines. However, with the latest versions of Sidra there is a new mechanism in Sidra web, that allows the configuration of data intake for a set of source systems. Users can choose from a gallery of available connector plugins to configure a Data Intake Process. The plugin is transparently installed in Sidra Core, and a wizard is displayed to the user. Each wizard contains a set of configuration parameters, so the user can enter those parameters to submit (confirm) the creation of the respective Data Intake Process.
Just after providing the configuration parameters in the respective wizard form steps, such as connection string parameters, or trigger selection, all the orchestration to create the underlying Data Intake Process happens. Thanks to the Sidra Data Intake Process wizards, the creation of this Data Intake Process infrastructure is configured in less than five minutes. Once the Data Intake Process for the data source is up and running, all the underlying metadata (Sidra Asset metadata) and infrastructure (Azure Data Factory objects) will be in place.
You can see the metadata table for the Data Intake Process objects in the Assets Metadata page.
Configure a new data intake
More details and concepts on what happens when configuring a new data intake in Sidra are also described in detail in this page .
Under this documentation section there are different documentation pages for specific Sidra connector plugins being released in Sidra.
Sidra Web Management UI includes a list with all configured Data Intake Processes configured in the respective installation environment. For accessing the Data Intake Process list in Sidra Web, you need to access the section Data Intake. A list will display the different configured Data Intake Processes in the system, and a button to Add a new Data Intake Process.
In future releases after 2022.R1, new actions will be made available from this table, like: seeing the details of a Data Intake Process, updating the configuration of a Data Intake Process or upgrading a Data Intake Process to the newest released version (a new released version of the underlying plugin).
Key Aspects to configure a new Data Intake Process¶
What defines a separate Data Intake Process from other Data Intake Processes depends on the actual underlying data intake configuration. For example:
Data intake from different data source engines (e.g., different database engines like SQL Server and Oracle), need to be created as different Data Intake Processes, because they use different plugins to connect and configure the data source.
Data intake from the same data source engine (e.g., DB2), but different databases, need to be created as different Data Intake Processes, as they use the same plugin, but different parameters to connect to the data source, like the connection string.
Data intake from the same data source system could be defined as separate Data Intake Processes. This is useful in the following example scenarios:
You need to create separate Data Intake Processes to load distinct set of tables from the same database, whose extraction is scheduled via the same Trigger in ADF.
You need to create separate Data Intake Processes to load distinct set of tables from the same database, maybe associating different triggers or schedules for the data extraction to each of the set of tables. In this case, one Data Intake Process would be using one trigger, and the other Data Intake Process would be using a different trigger.
Data Intake Process migrations and limitations¶
The Data Intake Process lifecycle is a big feature that is being developed and released across different Sidra versions.
In Sidra version 2022.R1 (1.11.x), the Data Intake Process has some limitations that will be covered in future versions of Sidra:
For all configured pipelines created before the release of the Data Intake Process (version 1.11.x):
- There is an automated migration process being applied on every installation environment together with the Sidra update process.
- This migration process creates the underlying Data Intake Process objects in Sidra Core metadata database, even if the configured data intake pipelines were not created via a connector plugin, or were created by a connector plugin that is NOT supporting this Data Intake Process concept.
- After this migration, the users can expect to see in the Data Intake Process list a list with the configured data intake in the environment.
- This does not affect however in any means to the normal functioning of the underlying data extraction pipelines which will continue working with no changes. Users are also not expected to perform any changes to their current working pipelines.
- The migration to create these Data Intake Processes will only therefore be available for the list of supported connector plugins in Sidra.
It is important to note that for those existing pipelines that do not have a Sidra plugin support, like customer specific pipelines, there will not be such migration.
This means, that even if there are some data intake pipelines configured, there will NOT be an associated Data Intake Process which can be seen in the Data Intake Process list in Sidra Web UI.
This does not affect the normal functioning of the underlying data intake (pipelines execution and loading of data in Sidra). The created pipelines will NOT be interrupted and will continue working normally.
Currently, it is only possible to create a Data Intake Process associated to a new Provider. It is not possible currently to create a new Data Intake Process associated to an existing Provider.
Nowadays, only users with role Admin are allowed to access this section.
New Data Intake Processes being created with existing Sidra Plugins after Release 2022.R1 (1.11.x) will already register the Data Intake Process in Sidra Core metadata database automatically at plugin execution time.
Plugin approach for Data Intake Processes¶
For Data Intake Processes to be created from the Web UI, it is required that the underlying data extraction and ingestion pipeline templates as well as code are packaged as a plugin in Sidra.
Plugins are an internal architecture concept in Sidra. Plugins behave as code assemblies that implement a series of interface methods for managing the installation and configuration of data extraction and ingestion elements and pipelines, as well as the creation of the associated metadata in Sidra (e.g., Provider, Entities, Attributes). Such assemblies allow for code for creating and deploying a plugin to be downloaded, installed, and executed from the Web UI.
Sidra incorporates several types of plugins. For plugins to create Data Intake Processes, the
plugin type = connector.
A plugin of type connector or plugin for Data Intake Process also packages the configuration parameters so that only a subset of the needed parameters is retrieved from the user.
These parameters will be the inputs to be filled in during the wizard steps.
Steps involved in the creation of a Data Intake Process via a plugin¶
The configuration steps needed to create a Data Intake process involve the registration configuration (e.g., Provider), as well as the ADF resources deployment configuration (e.g., data source parameters, metadata extraction parameters, trigger, etc.).
Step 1. Choose a plugin to create the Data Intake Process¶
In Sidra Web UI, go to the page
Data Intake > Data Intake Processes.*
*Note that this page is only available from Sidra version 2022.R1).
Please click on
- You will see a gallery with available plugins to configure the new Data Intake Process. This list includes the latest versions of all plugins that are compatible with the Sidra version in the corresponding installation.
- If the plugin had not been installed previously in the Sidra installation, a plugin installation process begins in the background. After the installation finishes, you will be able to see a wizard with the configuration steps. Please, note that the first-time installation of a plugin may take a few seconds. Please stay on the page to allow the installation to finish.
Step 2. Enter the configuration parameters to create the Data Intake Process¶
Once the wizard with the different configuration steps is loaded, you can start typing in the relevant configuration parameters for the creating the Data Intake Process and underlying infrastructure.
The configuration steps are logically grouped in two sets:
- The common configuration steps common to all plugins (e.g., Configure Provider, Configure Trigger).
- The specific configuration steps to any plugin (e.g., data source or metadata configuration).
Common configuration steps¶
Sidra plugins of type connector have some common configuration steps shared across all plugins.
The first step in a Data Intake Process is the actual configuration of the name and description of the Data Intake Process object:
This step is only available from Sidra version 2022.R1.
Name: This is the name that will appear in the list of Data Intake Processes in the Sidra Web UI. As such, it can be a friendly name to self-describe the process, e.g., CRM Database nightly load.
Description: This is an optional description about the Data Intake Process.
Another common configuration step is the configuration of the Provider. When creating a new Provider, the user can configure some metadata for storing this Provider in Sidra. The name of the Provider is mandatory.
These are the fields to be configured for a Provider on a connector plugin wizard:
Name: name representing the data origin that needs to be configured. This name can only contain alphanumeric characters.
Owner: business owner responsible for the creation of the Provider.
Short description: short description about the Provider.
Data Storage Unit (DSU): the DSU where the Provider will be set.
Details: markdown text with additional details on the Provider.
For more details on Sidra Metadata model, please check the documentation.
Another common infrastructure piece to all plugins of type connector is the Trigger. Connector plugins usually need to set up a scheduled trigger to use to execute the data extraction pipeline.
Users can choose to create a new trigger in Sidra to schedule this Data Intake Process, or re-use an existing trigger. Sidra connectors wizard allow to create scheduled triggers. When setting up a new scheduled trigger, users will need to provide some details, which are explained in the Data Factory documentation:
Start time: day and hour when the trigger will be executed for the first time.
End time: optional parameter. If not specified, the trigger will not have expiration date.
Frequency: units defined for the interval of executions of the trigger (minute, hour, day, week, month).
Interval: numeric value on how often to fire the scheduled trigger. The units of measure for this numeric value are given in the Frequency field.
There is also an option, in the Trigger configuration step, to choose to do a first execution of the data extraction pipeline just after the Data Intake Process is created. If this setting is set to True, this means that, regardless of the trigger, just after creating the Data Intake Process artifacts, the data extraction pipeline will be automatically run for the first time. Subsequent executions of the data extraction pipeline will obey to the configured trigger settings.
A note on Integration Runtime
To connect with some origin servers, it is required to have installed an Integration Runtime machine. This is the case of many source systems on-prem or in VNET.
The Integration Runtime can be installed in an on-prem machine or in an Azure VM, always when it can serve as a link between the DSU and the origin data server.
It is recommended to make an installation following a script that can be provided by the Sidra team. This script automates the JRE installation and the Visual C distribution installation, which are needed to perform the data ingestion in the Data Lake.
Additionally, for the installation of the above-mentioned modules, you may need to add the
JAVA_HOME to the environment variables.
To verify the correct installation, the following command can be installed:
Step 3. Validate the configuration and confirm¶
After the mandatory configuration parameters have been input, you will see a summary screen with the entered configuration details.
You can also optionally click on the button
Validateto validate the configuration parameters. Some validations take place in this step, such as:
- Check that the Provider name does not exist.
- Check that the data source parameters are correct and establish a test connection.
If the result of the validation is successful, then you will be able to click on button
The execution method of the plugin of type connector is invoked once the
Confirmationaction is triggered. This action submits the creation of the Data Intake Process and related metadata and infrastructure.
Once clicked, a toast message will be displayed in Sidra Web, indicating that the creation and configuration process have been started. After some minutes, once all the metadata has been registered in Sidra DB and once all the infrastructure has been created (ADF pipelines), a notification will be received.
Background processes executed¶
This execution method of the plugin internally packages the following generic actions as background processes:
- Create a new Data Intake Process record in Sidra Metadata DB.
- Create the new Provider from a template and relate it to the Data Intake Process.
- Create the new DataSource from a template.
- Create, deploy, and execute the new metadata extraction pipeline to populate the metadata for that data source, whenever it is applicable for the respective plugin.
- Create and deploy the new data extraction pipeline to physically move the data from source to destination.
- Create or configure the trigger for scheduling the data extraction pipeline. You can choose to create a new trigger or use an existing trigger.
Once all these above metadata and infrastructure is created, the data ingestion will be performed from the data extractor pipeline that the plugin execution deploys.
- For an introduction of the conceptual process involved in creating a new data intake in Sidra, you can check the documentation page.
- Each Release in Sidra will incorporate new connector plugins and versions.
- This documentation will include these Sidra-owned connector plugins as soon as they are incorporated into the Sidra product.
Common concepts used in Data Intake Process configuration¶
Type Translations and mappings¶
One important aspect when integrating data from source systems into Sidra, is the type translations or transformations for incompatible types between the source and the destination.
Sidra incorporates a table in the Core metadata table called TypeTranslations table.
When a plugin version is installed, the different type mapping and transformation rules for that source system will be populated in this table.
You can see the details of this table model in the Assets metadata section.
This table contains a series of mapping and transformation rules from sources to sink systems. These sink systems are the internal Sidra destinations along the whole ingestion process.
The different rules will be loaded and used to fill certain Attribute metadata fields (HiveType, SQLType), to interpret how to process the fields along the ingestion process.
The data extraction pipeline will then also use the type translation rules to convert to Parquet format.
An example of a data extraction type transformation rule is for the source type VARBINARY. In this case, the transformation will include the following expression:
Load restrictions applied in metadata extraction¶
Section Configure new data source describes the general conceptual steps about the configuration of a new data source in Sidra. One of the required steps is to configure and create the metadata structures about the data source (Provider, Entities and Attributes).
In the case of databases, the Data Intake Process wizard, usually incorporates the deployment and execution of a metadata extraction pipeline.
The metadata extraction pipeline reads into the schema of the source databases and creates the needed Entities and Attributes metadata in Sidra Core metadata tables.
The information about Entities and Attributes is obtained from that schema.
The metadata extraction pipeline also includes as a parameter a list of objects to include or exclude.
The set of objects is stored in some Sidra metadata
PipelineLoadRestriction are sets of objects to include or exclude from the origin data source, when performing the metadata extraction process.
When using Inclusion mode, the list of load restriction objects will be applied with an inclusion policy (just include the objects in the LoadRestrictionObject tables), or exclusion policy (load all objects, except the objects in the LoadRestrictionObject tables).
How to check pipelines in Data Factory¶
There is usually an option in the Data Intake Process wizard to force for the automatic execution of the data extraction pipeline right after the creation of the pipelines. In this case, in ADF we will see a pipeline execution for that pipeline. Pipeline executions can be seen in the Monitor ADF section. A filter allows to search for the pipeline and obtain the executions of that pipeline.
If the metadata extraction pipeline needs to be re-executed, we could go to the pipeline definition (Author), and launch the trigger. Click on
Add trigger > trigger now. A window will appear to pass as parameter the
ItemIDof the Provider. This
ItemIDof the Provider is obtained from the pipeline template.
If the data extraction pipeline needs to be manually executed or re-executed, we could go to the pipeline definition (Author), and launch the trigger. Click on
Add trigger > trigger now. A window will appear to pass as parameter the
executionDate(just a date).
Schema Evolution for database plugins in Sidra¶
Note that this section applies only from version 1.12 onwards.
Database plugins in Sidra support the automatic evolution of the source tables schema whenever new columns are added in said source tables. As a result of this, if the setting to include all tables and objects is set, new Attributes will be created in Sidra Core metadata for any new columns added in the source tables.
To activate this option, you would need to activate an optional setting in the Metadata Extractor options, which, by default, is set to
If set to
Yes, the metadata extraction pipeline that is created as part of this Data Intake Process creation will always be run before each execution of the data extraction pipeline. As a result, any new columns in the source database will be detected and incorporated as new Attributes in the respective Sidra Entity.
To achieve this, every time a Data Intake extraction is configured (via the associated Trigger), first, an orchestrator ADF pipeline will be run. This orchestrator ADF pipeline will do the following:
Check if the setting
Refresh Schema Automaticallyis set to
If set to
True, when the Trigger is hit, the deployed Metadata Extraction pipeline for that plugin and plugin version will be executed. This includes the queries to check for the schema of the source system, the creation of Entities and Attributes and the generation of the Transfer Query.
If set to
False, when the Trigger is hit, just the deployed data extraction pipeline for that plugin and plugin version will be executed.
Notifications for the success or failure of either the metadata extraction pipeline or the data extraction pipeline will be sent, so you can see the overall configuration success status. Please see below a depiction of this orchestrator pipeline in ADF:
Once in each execution of the metadata extraction pipeline, for each Entity, Sidra API endpoint for metadata inference will be called. This will check for any new Attributes added at the source.
The explanation above is made assuming that the source table is included to be loaded as per the definition of the Object Restriction list. The table will be included if:
- The setting is
include all tables.
- The setting is
include some tablesand the table is in the list.
- The setting is
exclude some tablesand the table is not in the list.
Apart from creating these new Attributes for added columns in the Sidra Core metadata, also new columns will be created in the respective Databricks tables for that Entity. For this, the Create Tables and Transfer Query script for the respective Entities will be re-created.
Once in each execution of the data extraction pipeline, Assets will be created which include the new columns in Databricks.
For Release 2022.R2 the following plugins will support this feature:
- Azure SQL
- SQL Server
This feature is not supported for the following plugins:
- Sharepoint Online List
Removal of columns from source systems¶
If columns are removed from the source system, there will not be an automated removal of Attributes in Sidra or from the Databricks tables. This scenario is only supported in a manual way. Please contact Support for this manual intervention.