Skip to content

Data Intake Process configuration via connector plugin

This page is intended to give more details about the configuration steps for a DIP done through connector plugins in Sidra.

Key Aspects

What defines a separate Data Intake Process from other Data Intake Processes depends on the actual underlying data intake configuration. For example:

  • Data intake from different data source engines (e.g., different database engines like SQL Server and Oracle), need to be created as different Data Intake Processes, because they use different plugins to connect and configure the data source.

  • Data intake from the same data source engine (e.g., DB2), but different databases, need to be created as different Data Intake Processes, as they use the same plugin, but different parameters to connect to the data source, like the connection string.

  • Data intake from the same data source system could be defined as separate Data Intake Processes. This is useful in the following example scenarios:

  • You need to create separate Data Intake Processes to load distinct set of tables from the same database, whose extraction is scheduled via the same Trigger in ADF.

  • You need to create separate Data Intake Processes to load distinct set of tables from the same database, maybe associating different triggers or schedules for the data extraction to each of the set of tables. In this case, one Data Intake Process would be using one trigger, and the other Data Intake Process would be using a different trigger.

Steps

The configuration steps needed to create a Data Intake process involve the registration configuration (e.g., Provider), as well as the ADF resources deployment configuration (e.g., data source parameters, metadata extraction parameters, trigger, etc.).

Step 1. Choose a plugin to create the Data Intake Process

  ≥ 2022.R1 
  1. In Sidra Web UI, go to the page Data Intake > Data Intake Processes.
  2. Please click on Add New.
  3. You will see a gallery with available plugins to configure the new Data Intake Process. This list includes the latest versions of all plugins that are compatible with the Sidra version in the corresponding installation.
  4. If the plugin had not been installed previously in the Sidra installation, a plugin installation process begins in the background. After the installation finishes, you will be able to see a wizard with the configuration steps. Please, note that the first-time installation of a plugin may take a few seconds. Please stay on the page to allow the installation to finish.

Step 2. Enter the configuration parameters to create the Data Intake Process

Once the wizard with the different configuration steps is loaded, you can start typing in the relevant configuration parameters for the creating the Data Intake Process and underlying infrastructure.

The configuration steps are logically grouped in two sets:

  • The common configuration steps common to all plugins (e.g., Configure Provider, Configure Trigger).
  • The specific configuration steps to any plugin (e.g., data source or metadata configuration).

Common configuration steps

Sidra plugins of type connector have some common configuration steps shared across all plugins.

  ≥ 2022.R1 
  1. The first step in a Data Intake Process is the actual configuration of the name and description of the Data Intake Process object:

    • Name: This is the name that will appear in the list of Data Intake Processes in the Sidra Web UI. As such, it can be a friendly name to self-describe the process, e.g., CRM Database nightly load.
    • Description: This is an optional description about the Data Intake Process.

  2. Another common configuration step is the configuration of the Provider. When creating a new Provider, the user can configure some metadata for storing this Provider in Sidra. The name of the Provider is mandatory.

    These are the fields to be configured for a Provider on a connector plugin wizard:

    • Name: name representing the data origin that needs to be configured. This name can only contain alphanumeric characters.
    • Owner: business owner responsible for the creation of the Provider.
    • Short description: short description about the Provider.
    • Data Storage Unit (DSU): the DSU where the Provider will be set.
    • Details: markdown text with additional details on the Provider.

    For more details on Sidra Metadata model, please check the documentation.

  3. Another common infrastructure piece to all plugins of type connector is the Trigger. Connector plugins usually need to set up a scheduled trigger to use to execute the data extraction pipeline.

    Sidra connectors wizard allow to create scheduled triggers. When setting up a new scheduled trigger, users will need to provide some details, which are explained in the Data Factory documentation:

    • Start time: day and hour when the trigger will be executed for the first time.
    • End time: optional parameter. If not specified, the trigger will not have expiration date.
    • Frequency: units defined for the interval of executions of the trigger (minute, hour, day, week, month).
    • Interval: numeric value on how often to fire the scheduled trigger. The units of measure for this numeric value are given in the Frequency field.

    There is also an option, in the Trigger configuration step, to choose to do a first execution of the data extraction pipeline just after the Data Intake Process is created. If this setting is set to True, this means that, regardless of the trigger, just after creating the Data Intake Process artifacts, the data extraction pipeline will be automatically run for the first time. Subsequent executions of the data extraction pipeline will obey to the configured trigger settings.

Versions

For versions below 2022.R2, users can choose to create a new trigger in Sidra to schedule the Data Intake Process, or re-use an existing trigger. From version 2022.R3 and higher, in order to avoid more than one DIP associated to the same trigger, this has been supressed, creating a new trigger whenever a new DIP is created. The user, in this case, has the chance to stop or start the trigger for each DIP.

A note on Integration Runtime

To connect with some origin servers, it is required to have installed an Integration Runtime machine. This is the case of many source systems on-prem or in VNET.

The Integration Runtime can be installed in an on-prem machine or in an Azure VM, always when it can serve as a link between the DSU and the origin data server.

It is recommended to make an installation following a script that can be provided by the Sidra team. This script automates the JRE installation and the Visual C distribution installation, which are needed to perform the data ingestion in the Data Lake.

Additionally, for the installation of the above-mentioned modules, you may need to add the JAVA_HOME to the environment variables.

To verify the correct installation, the following command can be installed:

java -version

Step 3. Validate the configuration and confirm

Steps

  1. After the mandatory configuration parameters have been input, you will see a summary screen with the entered configuration details.

  2. You can also optionally click on the button Validate to validate the configuration parameters. Some validations take place in this step, such as:

    • Check that the Provider name does not exist.
    • Check that the data source parameters are correct and establish a test connection.

  3. If the result of the validation is successful, then you will be able to click on button Confirmation.

  4. The execution method of the plugin of type connector is invoked once the Confirmation action is triggered. This action submits the creation of the Data Intake Process and related metadata and infrastructure.

  5. Once clicked, a toast message will be displayed in Sidra Web, indicating that the creation and configuration process have been started. After some minutes, once all the metadata has been registered in Sidra DB and once all the infrastructure has been created (ADF pipelines), a notification will be received.

Background processes executed

This execution method of the plugin internally packages the following generic actions as background processes:

  • Create a new Data Intake Process record in Sidra Metadata DB.
  • Create the new Provider from a template and relate it to the Data Intake Process.
  • Create the new Data Source from a template.
  • Create, deploy, and execute the new metadata extraction pipeline to populate the metadata for that data source, whenever it is applicable for the respective plugin.
  • Create and deploy the new data extraction pipeline to physically move the data from source to destination.
  • Create or configure the trigger for scheduling the data extraction pipeline. You can choose to create a new trigger or use an existing trigger.
  • Prepare the DSU ingestion script for data ingestion in Databricks.

Once all these above metadata and infrastructure is created, the data ingestion will be performed from the data extractor pipeline that the plugin execution deploys.

For information about the ADF flow of pipelines, you can check this section for this kind of DIP.


Sidra Ideas Portal


Last update: 2022-11-17
Back to top