Skip to content

How to configure a new data intake in Sidra

This section describes the general process of setting up a new Data Intake Process for end-to-end extract and load pipelines. These are the types of Data Intake Processes to be configured for sources like databases. Please visit Sidra Data Intake Process overview for an overview and basic concepts of the Data Intake Process in Sidra.

This page also includes references to common concepts and pointers to specific methods for performing some of the steps for the described type of data intake.


Asset flow into Sidra's platform via landing zone

For a detailed explanation of a complete Data Intake Process using the landing zone, please continue in this page .

Step 1. Configure and create the Data Source

One of the key elements of ADF that need to be created and deployed in ADF besides pipelines is the Linked Service.

Linked Services in Sidra are abstracted under the term Data Source. A Data Source in Sidra represents the connection to the source data system. A Data Source abstracts the details of creating a Linked Service in Azure Data Factory. Then Linked Services behave as connection strings, allowing Data Factory to connect to different data sources like SQL servers, REST endpoints, Blob folders, etc. through several Data Source templates provided by Sidra for common source systems as, e.g., "AzureSql", "HttpServer", etc.

Sidra provides REST endpoints to register a new Data Source from a template, as well as to automatically deploy such infrastructure in Azure Data Factory for the pipelines to use. This is the recommended method for registering a Data Source in Sidra and therefore deploying a Linked Service in ADF. For some types of data sources, like SQL database, there is even a more automated and friendly way, that is using Connectors from the Web.

For configuring a new Data Source:

  1. The fields required are the Integration Runtime and the connection string in the case of a database.
  2. Sidra requires to register a new Data Source registry in Sidra metadata DB.
  3. Sidra metadata DB will trigger the deployment of a Linked Service in Azure Data Factory with this connection.
  4. The created Linked Service will use the Key Vault in Sidra Core to store the connection string to the database.


For more details on Linked Services check the Data Factory documentation.


Connecting to new data sources through linked services

For more details on the manual process of what constitutes the creation of a new linked service from a template, you can visit the page Connecting to new data sources through linked services .

Step 2. Define and configure the Asset metadata for the Data Intake Process


Section Sidra metadata model hierarchy contains an explanation of the key pieces that conform the Sidra metadata model, namely DSU, Providers, Entities and Attributes.

Any new data source to be configured in Sidra requires this metadata to be populated as a first to represent the actual data to be ingested. Sidra metadata is stored in the DataIngestion schema inside the Sidra Core database. The metadata information contains information about the Entities to be ingested, as well as how these Entities should be handled by the system.

Sidra adds additional system-level Attributes to convey system-handling information at Entity (or table) level.

These are the different steps required in order to set up the metadata for a new data source in Sidra. Links to tutorial pages are included for each of the steps for a generic data source:

  1. Create a new Provider (if required). A new data source will be composed of one or several Entities. Such Entities can be part of an existing Provider (because they are logically related to that Provider), or we can choose to associate these Entities to an existing Provider. More details on how to set up a Provider in Sidra are here. For example, a whole database could be considered a Provider or part of a Provider.
  2. Create Entity/Entities. A new data source is composed of at least one new Entity. For example, each table in an SQL database will be considered each a different Entity. More details on how to configure an Entity in Sidra are here.
  3. Create new Attributes. For example, to specify each of the fields of a database table or a CSV file. More details on how to configure an Attribute in Sidra are here.
  4. Create an AttributeFormat (if needed): More details on how to configure an AttributeFormat in Sidra are here.

Methods to configure the metadata for a Data Intake Process

Depending on the type of data source, Sidra incorporates helper methods and accelerators to do the above steps, and configure the actual file ingestion into the DSU.

Most generally, these are the ways to populate the metadata in Sidra Core database:

Metadata population in Sidra Core database

  • See details in above tutorial links.
  • See details in above tutorial links.
  • This is a recommended option over the SQL scripts method, if the data source does not include yet the support for a Sidra Connector (see next point).
  • For most of databases (e.g., SQL database), Sidra incorporates accelerators like this.
  • Metadata extractor pipelines create a pipeline associated with the Provider in order to automatically retrieve and infer all the metadata from the data source.
  • Examples of the pipeline templates generating these type of pipelines are: Insert metadata using database schema for TSQL and Insert metadata using database schema for DB2.
  • These type of pipelines can be deployed and executed through API calls.


How to configure tables to exclude

Tutorial How to configure tables to exclude describes how to configure tables to be excluded from the SQL data extraction when using a SQL metadata extraction pipeline.

Metadata Extraction Pipelines

This accelerators available for some sources, like SQL databases, will be used for two main purposes internally in Sidra:

  • Create and populate the Sidra Core metadata structures (Entities and Attributes), from the information of the different tables and columns in the source database system. This would encompass the above actions for creating Entities and Attributes.
  • Prepare for the actual file ingestion into the Data Storage Unit by auto-generating the create tables and transfer query scripts for each Entity, including all the steps for storing the optimized data in the data lake storage. This is described in Step 2 in this document.

Once the metadata extractor pipeline has been created, it needs to be executed. This execution could take a while (from 5 to 15 minutes) depending on the load of Data Factory and the status and load of the associated Databricks cluster. Sidra exposes API endpoints as well to manually execute the SQL Metadata extractor pipeline.

If metadata extraction pipelines are not available for the data source, then the process for creating the Asset metadata extraction will need to be manually triggered, and executed as described in the above links (Create Entities, Create Attributes, etc.).

Step 3. Define and create the data extraction pipeline to extract data from the source system

Once all the metadata regarding a set of Assets from a data source has been configured (Provider, Entities, Attributes) in step 2, the next step is to populate the metadata database with the information of the Azure Data Factory pipelines.

The movement of data in Sidra platform is orchestrated by Azure Data Factory (ADF). Every object in ADF is defined using a JSON structure that can be used along with the API and SDKs provided by Azure Data Factory (ADF) to deploy pipelines programmatically.

Sidra Core stores the information that defines a pipeline in the Core Database, by using a series of pipeline templates. Those ADF component templates are the JSON structures that define the component, but containing placeholders. When the placeholders are resolved with actual values, the resulting JSON structure can be used to create the Data Factory components in ADF.

Sidra data platform takes advantage of this capability and provides a metadata system for storing those JSON structures and a component to automatically deploy those pipelines in ADF:

  • The metadata system is composed by a set of tables in the metadata database that are generically called Data Factory metadata, located in the DataIngestion schema of the metadata database.*
  • Data Factory Manager is the internal Sidra component (the user does not need to configure anything for this), that uses the Data Factory metadata and programmatically deploys the ADF objects defined in it.

This system automates the deployment of pipelines so they can be easily setup in any environment -for example to deploy to a new test environment- at any time given. The idea is not only to reproduce the environment configuration in Azure Data Factory, but also being able to add new pipelines by adding information to the metadata.

The Data Factory metadata is organized using templates for each of the ADF components -activities, datasets, triggers, pipelines. This facilitates the creation of new pipelines by reusing the ones already created.



Configure new pipelines

Section Configure new pipelines describes the step-by-step process to configure a new pipeline based on the templates provided by Sidra.

Configure the data extraction and ingestion pipeline

Sidra incorporates different out of the box pipelines and pipeline templates to orchestrate data extraction from source systems.

Data extraction pipeline templates create pipelines associated with the Provider in order to orchestrate the data extraction from the source.

The data extraction and ingestion pipelines to use vary depending on the type of data source.

For some types of data sources, like databases, Sidra comes with data extractor pipeline templates, like the BatchExtractAndIntakeSQL. Pipelines created out of this template perform transparently all actions from the actual extraction or movement of data from the source system, to the execution of the actual file ingestion into the Data Storage Unit. Steps 4 and 5 in this document describe how to deploy and execute these pipelines.

The specific pipeline to deploy and run is selected based on the configuration of the Entity to which the Asset is associated. There is an Entity-Pipeline association that needs to happen in order to associate an Entity with a data extraction pipeline.

Step 4. Prepare or create scripts to create the tables in Databricks that will store the final data

This step involves the code auto-generation (not its execution) required in order to eventually execute the ingestion of the data into the Data Storage Unit (Databricks).

The page Data Ingestion describes the general steps involved in a very typical data ingestion in Sidra: file batch ingestion from the landing zone. Within this flow, it is explained that the part of actual data ingestion into the DSU is comprised of two stages: file registration and file ingestion in Databricks.

In the above page, it is explained that, after performing the actual data extraction from the data source into Sidra Landing page or similar, the actual file ingestion process is performed to actually ingest the data in an optimized way in the Data Storage Unit. The file ingestion is the process that reads the raw copy of the file and persists the information in an optimized format in the DSU, after executing some initial optimizations. Data ingestion in Databricks describes the process in more detail.

The specific File ingestion part has as a pre-requirement the generation of some scripts. Two different Spark scripts need to be created sequentially: Table creation and Transfer query scripts.

These scripts are created specifically for the Entity of the Asset to be ingested. This means that Assets from the same Entity will share the same scripts.

  1. The Table creation script creates the necessary database and tables in the Databricks cluster. More detailed information is included in Table creation.
  2. The Transfer query script reads the raw copy of the Asset and inserts the information in the tables created by the previous script in an optimized format. One transfer query script is generated per Entity. More information about this script can be found in Transfer query section.

In this step about preparing the file ingestion structures, the create tables and transfer query scripts are created with the parametrization of the Entity metadata details. Therefore, there will be one transfer query script generated per Entity.


For a detailed explanation on the file ingestion step, you can check this other page.

Depending on the type of data source, we can distinguish where this step 4 (prepare the file ingestion structures) is actually being implemented:

  • The metadata extraction pipelines available for some database sources (SQL) implement transparently this step of preparing the transfer query scripts for data ingestion in Databricks. Afterwards, it is the actual data extraction pipeline which executes the transfer query scripts.

  • For the typical flow of file ingestion from the landing zone, the pipeline FileIngestionDatabricks contains a step to generate the transfer query scripts.

In a later step (step 5), the actual execution of the transfer query scripts will happen from a Databricks ADF activity.

Step 5. Deploy the data extraction and data ingestion pipeline

Once the different pipelines for data extraction have been created (for example, any of the ExtractFrom{data source name} these pipelines need to be deployed into Azure Data Factory.

Data extraction pipelines have a pre-requirement of having all the metadata about the sources configured. In case there is a metadata extractor pipeline available, this pipeline needs to be deployed and executed first before proceeding with the deployment and execution of the data extractor pipeline (see Step 1 in this document). Sidra API incorporates different methods for deploying ADF pipelines.

As described before, data extraction pipelines differ depending on the type of the data source.

Deploying an ADF pipeline can take several minutes depending on the status of the environment. Processing cluster resources (e.g. Databricks) are required in order to perform this operation in Sidra Core.

Configure the trigger to schedule the data extraction

  • Sidra incorporates different templates for ADF triggers in order to schedule the execution of the data extractor pipelines.

  • Triggers are basically representations of the specific schedule for running an ADF pipeline (example, daily executions at a fixed time).

  • Sidra provides TriggerTemplates in order to instantiate a new trigger. Sidra Core API provides an endpoint to create a new trigger from the Id of the template.

  • The purpose of the trigger is to execute an associated pipeline at the desired time. Existing triggers can also be used instead of creating a new trigger.

  • In any case a request needs to be made to Sidra API to associate the data extractor pipeline with the newly created or existing trigger.

Step 6. Execute the data extraction and ingestion pipelines

Once the data extraction and ingestion pipelines have been deployed, they need to be executed or run in order to perform the actual data extraction.

As described before, data extraction pipelines differ depending on the type of the data source.

When the actual ingestion process happens the following actions take place:

  1. The file registration. This is the process of creating an Asset in the platform representing the file to be ingested in the DSU.
    • The result of the registration is the population of the correct data (registration of the Asset) in the Sidra Core intake metadata and control tables. The files are registered using the Sidra API.

  2. The file ingestion. This is the process that reads the raw copy of the file and intakes the information in the DSU, after executing some initial optimizations (e.g. error validation, partitioning, etc.).
    • The file ingestion sub-step is performed by an Azure Data Factory pipeline that will be selected depending on the configuration of the Entity associated to the Asset.
    • The file ingestion involves the execution of the transfer query scripts prepared for each Entity in step 2.

Depending on the type of data source, the extraction and ingestion pipelines differ in the number of steps included. Data extraction pipelines for database sources perform all the steps required to ingest the data into the Data Storage Unit, from the actual movement of data from the source, to the Asset registration and the actual execution of the transfer query scripts that perform the data optimization actions before storing the data in the Data Storage Unit (file ingestion). Therefore, in this case there is not a distinction between data extraction and data ingestion pipelines.

Sidra Ideas Portal

Last update: 2022-07-14
Back to top