About Sidra connectors and data intake process wizard¶
Sidra data intake process wizard, also known as Sidra connectors, simplify the configuration of new data intake processes in a Sidra Data Storage Unit, by just following a few visual steps.
When configuring and executing any new Sidra data intake process, several underlying steps are involved.
- On one hand, the necessary metadata and data governance structures are created in Sidra. This includes the creation of the data source (equivalent to a linked service in Azure Data Factory), that is specific to the engine and type of the source system (e.g., SQL Server, Azure SQL, DB2 database, etc.).
- On the other hand, the actual data integration infrastructure elements (e.g., ADF pipelines and triggers) are created, configured and deployed.
More details and concepts on what happens when configuring a new data intake process in Sidra are also described in this page.
Thanks to the Sidra data intake process wizard, or Connectors, all this data intake process infrastructure is configured in less than five minutes.
Just after providing a few details, al the orchestration will happen and the data intake process for the data source will be up and running, with all the underlying metadata and infrastructure in place.
Under this section there are different documentation pages for specific Sidra connectors.
For accessing the Connectors wizard in Sidra Web, you need to access the section Data > Connectors. Starting this process will launch the configuration of a new data intake process on a new Provider.
NOTE: currently only users with role Admin are allowed to access this section.
In order for data intake processes to new data sources to be created from the Web UI, it is required that the underlying data extraction and ingestion pipeline templates as well as code are packaged as a plugin in Sidra.
Plugins are an internal architecture concept in Sidra. Plugins behave as code assemblies that implement a series of interface methods for managing the installation and configuration of data extraction and ingestion elements and pipelines, as well as the creation of the associated metadata in Sidra (e.g., Provider, Entities, Attributes).
Such assemblies allow for code for creating and deploying a plugin to be downloaded, installed, and executed from the Web UI.
A connector or data intake configuration process is just an execution instance of a plugin, where
plugin type = connector.
Plugins for data intake process also package the configuration parameters so that only a subset of the needed parameters are retrieved from the user.
These parameters will be the inputs to be filled in during the wizard steps.
These configuration steps involve the registration configuration (e.g., Provider), as well as the resources deployment configuration (e.g., data source parameters, metadata extraction parameters, trigger, etc.).
The execution method of the plugin of type connector, which is invoked once the Confirm action is triggered after the wizard steps, packages transparently the following generic actions:
- Create the new Provider from a template.
- Create the new DataSource from a template.
- Create, deploy and execute the new metadata extraction pipeline, in order to populate the metadata for that data source.
- Create and deploy the new data extraction pipeline, in order to physically move the data from source to destination.
- Create or configure the trigger for scheduling the data extraction pipeline. You can choose to create a new trigger or use an existing trigger.
For an introduction of the generic process involved in creating a new data source in Sidra, you can check the documentation page.
Each Release in Sidra will incorporate new connector plugins and versions.
This documentation will include these Sidra-owned connector plugins as soon as they are incorporated into the Sidra product.
Below are some common features implemented for the data intake processes configuration. Some of these settings may be enabled by different UI fields in the connectors plugin wizards. Some of them could also be used as specific parameters by data intake pipelines, if they are to be used outside of the plugin framework. This could be the case of a specific use case where the plugin does not fully fit some specific requirements, so it is decided to use Sidra API for the configuration of the specicif data intake process.
Type Translations and mappings¶
One important aspect when integrating data from source systems into Sidra, is the type translations or transformations for incompatible types between the source and the destination.
Sidra incorporates a table in the Core metadata table called TypeTranslations table.
When a plugin version is installed, the different type mapping and transformation rules for that source system will be populated in this table. You can see the details of this table model in the Assets metadata section.
This table contains a series of mapping and transformation rules from sources to sink systems. These sink systems are the internal Sidra destinations along the whole ingestion process.
The different rules will be loaded and used to fill certain Attribute metadata fields (HiveType, SQLType), in order to interpret how to process the fields along the ingestion process. The data extraction pipeline will then also use the type translation rules to convert to Parquet format.
An example of a data extraction type transformation rule is for the source type VARBINARY. In this case, the transformation will include the following expression:
Load restrictions applied in metadata extraction¶
Section Configure new data source describes the general conceptual steps about the configuration of a new data source in Sidra. One of the required steps is to configure and create the metadata structures about the data source (Provider, Entities and Attributes).
In the case of databases, the data intake process (connectors wizard), usually incorporates the deployment and execution of a metadata extraction pipeline.
The metadata extraction pipeline reads into the schema of the source databases and creates the needed Entities and Attributes metadata in Sidra Core metadata tables.
The information about Entities and Attributes is obtained from that schema.
The metadata extraction pipeline also includes as a parameter a list of objects to include or exclude.
The set of objects is stored in some Sidra metadata
PipelineLoadRestriction are sets of objects to include or exclude from the origin data source, when performing the metadata extraction process.
When using Inclusion mode, the list of load restriction objects will be applied with an inclusion policy (just include the objects in the LoadRestrictionObject tables), or exclusion policy (load all objects, except the objects in the LoadRestrictionObject tables).
Common sections configuration steps¶
The majority, if not all, plugins will have some common configuration steps. This is for example the case of the configuration of the Provider.
When creating a new Provider, the user can configure some metadata for storing this Provider in Sidra. The name of the Provider is mandatory.
These are the fields to be configured for a Provider on a connector plugin wizard:
- Name: name representing the data origin that needs to be configured. This name can only contain alphanumeric characters.
- Owner: business owner responsible for the creation of the Provider.
- Short description: short description about the Provider.
- Data Storage Unit (DSU): the DSU where the Provider will be set.
- Details: markdown text with additional details on the Provider.
For more details on Sidra Metadata model, please check the documentation.
Another common infrastructure piece to all connectors is the Trigger.
Connectors usually need to set up a scheduled trigger to use in order to execute the data extraction pipeline.
Users can choose to create a new trigger in Sidra to schedule this connector intake, or re-use an existing trigger. Sidra connectors wizard allow to create scheduled triggers. When setting up a new scheduled trigger, users will need to provide some details, which are explained in the Data Factory documentation:
- Start time: day and hour when the trigger will be executed for the first time.
- End time: optional parameter. If not specified the trigger will not have expiration date.
- Frequency: units defined for the interval of executions of the trigger (minute, hour, day, week, month).
- Interval: numeric value on how often to fire the scheduled trigger. The units of measure for this numeric value are given in the Frequency field.
A note on Integration runtime¶
In order to connect with some origin servers, it is required to have installed an Integration Runtime machine. This is the case of many source systems on-prem or in VNET. The Integration Runtime can be installed in an on-prem machine or in an Azure VM, always when it can serve as a link between the DSU and the origin data server. It is recommended to make an installation following a script that can be provided by the Sidra team. This script automates the JRE installation and the Visual C distribution installation, which are needed to perform the data ingestion in the Data Lake. Additionally, for the installation of the above-mentioned modules, it can be needed to add the JAVA_HOME to the environment variables. In order to verify the correct installation, the following command can be installed: java -version.
How to test the connection and confirm¶
Once all the steps have been completed for the configuration of the data source, there is an screen to see a summary of all the configured parameters.
An optional button to Test Connection can be executed, which performs a connection against the origin and also potentially validates other input parameters. If the connection is successful, a success message will be displayed. Otherwise an error modal message will appear with the details.
Once all the input parameters are complete, and the user clicks on Confirmation button, a toast message will be displayed in Sidra Web, indicating that the creation and configuration process has been started. After some minutes, once all the metadata has been registered in Sidra DB and once all the infrastructure has been created (ADF pipelines), a notification will be received.
The data ingestion will be performed from the data extractor pipeline that the plugin execution deploys.
How to check pipelines in Data Factory¶
There is usually an option in the connectors wizard to force for the automatic execution of the data extraction pipeline right after the creation of the pipelines. In this case, in ADF we will see a pipeline execution for that pipeline. Pipeline executions can be seen in the Monitor ADF section. A filter allows to search for the pipeline and obtain the executions of that pipeline.
If the metadata extraction pipeline needs to be re-executed, we could go to the pipeline definition (Author), and launch the trigger. Click on Add trigger > trigger now. A window will appear to pass as parameter the ItemID of the Provider. This ItemID of the Provider is obtained from the pipeline template.
If the data extraction pipeline needs to be manually executed or re-executed, we could go to the pipeline definition (Author), and launch the trigger. Click on Add trigger > trigger now. A window will appear to pass as parameter the executionDate (just a date).