Data Intake Process configuration via connectors¶

This page is intended to give more details about the configuration steps for a DIP done through connectors in Sidra.

Key Aspects¶

What defines a separate Data Intake Process from other Data Intake Processes depends on the actual underlying data intake configuration. For example:

Data intake from different data source engines (e.g., different database engines like SQL Server and Oracle), need to be created as different Data Intake Processes, because they use different connectors to connect and configure the data source.
Data intake from the same data source engine (e.g., DB2), but different databases, need to be created as different Data Intake Processes, as they use the same connectors, but different parameters to connect to the data source, like the connection string.
Data intake from the same data source system could be defined as separate Data Intake Processes. This is useful in the following example scenarios:
You need to create separate Data Intake Processes to load distinct set of tables from the same database, whose extraction is scheduled via the same Trigger in ADF.
You need to create separate Data Intake Processes to load distinct set of tables from the same database, maybe associating different triggers or schedules for the data extraction to each of the set of tables. In this case, one Data Intake Process would be using one trigger, and the other Data Intake Process would be using a different trigger.

Steps¶

The configuration steps needed to create a Data Intake process involve the registration configuration (e.g., Provider), as well as the ADF resources deployment configuration (e.g., data source parameters, metadata extraction parameters, trigger, etc.).

Step 1. Choose a connector to create the Data Intake Process¶

In Sidra Web UI, go to the page Data Intake > Data Intake Processes.
Please click on Add New.
You will see a gallery with available connectors to configure the new Data Intake Process. This list includes the latest versions of all connectors that are compatible with the Sidra version in the corresponding installation.
If the connector had not been installed previously in the Sidra installation, a connector installation process begins in the background. After the installation finishes, you will be able to see a wizard with the configuration steps. Please, note that the first-time installation of a connector may take a few seconds. Please stay on the page to allow the installation to finish.

Step 2. Enter the configuration parameters to create the Data Intake Process¶

Once the wizard with the different configuration steps is loaded, you can start typing in the relevant configuration parameters for the creating the Data Intake Process and underlying infrastructure.

The configuration steps are logically grouped in two sets:

The common configuration steps common to all connectors (e.g., Configure Provider, Configure Trigger).
The specific configuration steps to any connector (e.g., data source or metadata configuration).

Common configuration steps¶

Sidra connectors have some common configuration steps shared across all connectors.

The first step in a Data Intake Process is the actual configuration of the name and description of the Data Intake Process object:
- Name: This is the name that will appear in the list of Data Intake Processes in the Sidra Web UI. As such, it can be a friendly name to self-describe the process, e.g., CRM Database nightly load.
- Description: This is an optional description about the Data Intake Process.
Another common configuration step is the configuration of the Provider. When creating a new Provider, the user can configure some metadata for storing this Provider in Sidra. The name of the Provider is mandatory.

These are the fields to be configured for a Provider on a connector connector wizard:
- Name: name representing the data origin that needs to be configured. This name can only contain alphanumeric characters.
- Owner: business owner responsible for the creation of the Provider.
- Short description: short description about the Provider.
- Data Storage Unit (DSU): the DSU where the Provider will be set.
- Details: markdown text with additional details on the Provider.
For more details on Sidra Metadata model, please check the documentation.
Configure the Data Source. Here the connection details need to be added. About Integration Runtime, the options are:
- Azure Hosted (default)
- Shelf Hosted. For this option, take into account that the Integration Runtime should be linked in ADF before creating a DIP.
Configure Metadata Extractor. This step allows the selection of objects to exclude or number of tables.
Another common infrastructure piece to all connectors is the Trigger. Connector connectors usually need to set up a scheduled trigger to use to execute the data extraction pipeline.

Sidra connectors wizard allow to create scheduled triggers. When setting up a new scheduled trigger, users will need to provide some details, which are explained in the Data Factory documentation:
- Start time: day and hour when the trigger will be executed for the first time.
- End time: optional parameter. If not specified, the trigger will not have expiration date.
- Frequency: units defined for the interval of executions of the trigger (minute, hour, day, week, month).
- Interval: numeric value on how often to fire the scheduled trigger. The units of measure for this numeric value are given in the Frequency field.
There is also an option, in the Trigger configuration step, to choose to do a first execution of the data extraction pipeline just after the Data Intake Process is created. If this setting is set to True, this means that, regardless of the trigger, just after creating the Data Intake Process artifacts, the data extraction pipeline will be automatically run for the first time. Subsequent executions of the data extraction pipeline will obey to the configured trigger settings.

Versions

From version 2022.R3 and higher, in order to avoid more than one DIP associated to the same trigger, this has been suppressed, creating a new trigger whenever a new DIP is created. The user, in this case, has the chance to stop or start the trigger for each DIP.

Step 3. Validate the configuration and confirm¶

Steps¶

After the mandatory configuration parameters have been input, you will see a summary screen with the entered configuration details.
You can also optionally click on the button Validate to validate the configuration parameters. Some validations take place in this step, such as:
- Check that the Provider name does not exist.
- Check that the data source parameters are correct and establish a test connection.
If the result of the validation is successful, then you will be able to click on button Confirmation.
The execution method of the connector is invoked once the Confirmation action is triggered. This action submits the creation of the Data Intake Process and related metadata and infrastructure.
Once clicked, a toast message will be displayed in Sidra Web, indicating that the creation and configuration process have been started. After some minutes, once all the metadata has been registered in Sidra DB and once all the infrastructure has been created (ADF pipelines), a notification will be received.

Background processes executed¶

This execution method of the connector internally packages the following generic actions as background processes:

Create a new Data Intake Process record in Sidra Metadata DB.
Create the new Provider from a template and relate it to the Data Intake Process.
Create the new Data Source from a template.
Create, deploy, and execute the new metadata extraction pipeline to populate the metadata for that data source, whenever it is applicable for the respective connector.
Create and deploy the new data extraction pipeline to physically move the data from source to destination.
Create or configure the trigger for scheduling the data extraction pipeline. You can choose to create a new trigger or use an existing trigger.
Prepare the DSU ingestion script for data ingestion in Databricks.

Once all these above metadata and infrastructure is created, the data ingestion will be performed from the data extractor pipeline that the connector execution deploys.

For information about the ADF flow of pipelines, you can check this section for this kind of DIP.