How to configure a new data intake in Sidra

This section describes the general process of setting up a new data intake process for end-to-end extract and load pipelines. These are the types of data intake processes to be configured for sources like databases. Please visit Sidra data intake process overview for an overview and basic concepts of the data intake process in Sidra.

This page also includes references to common concepts and pointers to specific methods for performing some of the steps for the described type of data intake.

For the specific details on data ingestion from landing zone, please visit this page.

Basic steps and concepts

The generic steps for configuring a data intake process of are the following:

  • Step 1: Configure and create the Data Source. This creates an underlying ADF Linked Service used in ADF to actually connect to the source system.
  • Step 2: Define and configure the Asset metadata for the data intake process (e.g., Provider/Entity/Attribute).
  • Step 3: Define and create the data extraction pipeline to actually extract data from the source system at defined scheduled intervals.
  • Step 4: Prepare or create scripts to create the tables in Databricks that will store the final data (Table creation and Transfer Query scripts).
  • Step 5: Deploy the data extraction pipeline and associate a trigger.

Depending on the type of data source some of these steps may be simplified or executed together. Also depending on the type of data source, Sidra incorporates some additional components that abstract some of the details of these steps and wrap them into a set of self-service UI interface wizard steps. This is thanks to the plugin approach described in Sidra connectors section. See section on Connectors wizard for more details on what Sidra connectors are and how they can be used from Sidra Web to set up new data intake processes.

After these configuration steps have been completed, a new data intake process will be completed.

The following steps involve the actual periodic execution of the configured extract and load process:

  • Step 6: Execution of the data intake (Extract and Load) process
  • Sub-step 6.1: Extract data from source: the data is extracted directly from the data source (e.g. SQL Database), to the container that hosts the raw format files in the Data Storage Unit. In this case the files do not go through the landing zone.
  • Sub-step 6.2: File (Asset) registration.
  • Sub-step 6.3: File ingestion.

Below is a high-level description of what these steps mean and why they are required. The specifics for different types of data sources will also be commented there for more clarity.

Step 1: Configure and create the Data Source

One of the key elements of ADF that need to be created and deployed in ADF besides pipelines is the Linked Service.

Linked Services in Sidra are abstracted under the term Data Source. A Data Source in Sidra represents the connection to the source data system.

A Data Source abstracts the details of creating a Linked Service in Azure Data Factory. They behave as connection strings, allowing Data Factory to connect to different data sources: SQL servers, REST endpoints, Blob folders, etc.

Usually, the fields required for setting up a data source are the Integration Runtime and the connection string in the case of a database.

For configuring a new Data Source, Sidra requires to register a new Data Source registry in Sidra metadata DB, which in turn will trigger the deployment a Linked Service in Azure Data Factory with this connection. The created Linked Service will use the Key Vault in Sidra Core to store the connection string to the database.

Sidra provides several Data Source templates for common data sources, e.g. "AzureSql", "HttpServer", etc.

For more details on Linked Services check the Data Factory documentation.

Sidra provides REST endpoints to register a new Data Source from a template, as well as to automatically deploy such infrastructure in Azure Data Factory for the pipelines to use. This is the recommended method for registering a Data Source in Sidra and therefore deploying a Linked Service in ADF. For some types of data sources, like SQL database, there is even a more automated and friendly way, that is using Connectors from the Web. For more details on the manual process of what constitutes the creation of a new linked service from a template, you can visit the page Connecting to new data sources through linked services

Step 2: Define and configure the Asset metadata for the data intake process

Section Sidra metadata model hierarchy contains an explanation of the key pieces that conform the Sidra metadata model, namely DSU, Providers, Entities and Attributes.

Any new data source to be configured in Sidra requires this metadata to be populated as a first to represent the actual data to be ingested. Sidra metadata is stored in the DataIngestion schema inside the Sidra Core database. The metadata information contains information about the Entities to be ingested, as well as how these Entities should be handled by the system.

Sidra adds additional system-level Attributes to convey system-handling information at Entity (or table) level.

These are the different steps required in order to set up the metadata for a new data source in Sidra. Links to tutorial pages are included for each of the steps for a generic data source:

  • Create a new Provider (if required). A new data source will be composed of one or several Entities. Such Entities can be part of an existing Provider (because they are logically related to that Provider), or we can choose to associate these Entities to an existing Provider. More details on how to set up a Provider in Sidra are here. For example, a whole database could be considered a Provider or part of a Provider.
  • Create Entity/Entities. A new data source is composed of at least one new Entity. For example, each table in an SQL database will be considered each a different Entity. More details on how to configure an Entity in Sidra are here.
  • Create new Attributes. For example, to specify each of the fields of a database table or a CSV file. More details on how to configure an Attribute in Sidra are here.
  • Create an AttributeFormat (if needed): More details on how to configure an AttributeFormat in Sidra are here.

Methods to configure the metadata for a data intake process

Depending on the type of data source, Sidra incorporates helper methods and accelerators to do the above steps, and configure the actual file ingestion into the DSU.

Most generally, there are four main ways to populate the metadata in Sidra Core database:

  • By means of SQL scripts (see details in above tutorial links).
  • By means of API calls (see details in above tutorial links). This is a recommended option over the SQL scripts method, if the data source does not include yet the support for a Sidra Connector (see next point).
  • For most of databases (e.g., SQL database), Sidra incorporates accelerators like metadata extraction pipelines: Metadata extractor pipelines create a pipeline associated with the Provider in order to automatically retrieve and infer all the metadata from the data source. Examples of the pipeline templates generating these type of pipelines are: Insert metadata using database schema for TSQL and Insert metadata using database schema for DB2. These type of pipelines can be deployed and executed through API calls. Tutorial How to configure tables to exclude describes how to configure tables to be excluded from the SQL data extraction when using a SQL metadata extraction pipeline.

The above-mentioned metadata extraction pipelines available for some sources, like SQL databases, will be used for two main purposes internally in Sidra:

  • Create and populate the Sidra Core metadata structures (Entities and Attributes), from the information of the different tables and columns in the source database system. This would encompass the above actions in Step 1 for creating Entities and Attributes.
  • Prepare for the actual file ingestion into the Data Storage Unit by auto-generating the create tables and transfer query scripts for each Entity, including all the steps for storing the optimized data in the data lake storage. This is described in Step 2 in this document.

Once the metadata extractor pipeline has been created, it needs to be executed. This execution could take a while (from 5 to 15 minutes) depending on the load of Data Factory and the status and load of the associated Databricks cluster. Sidra exposes API endpoints as well to manually execute the SQL Metadata extractor pipeline.

If metadata extraction pipelines are not available for the data source, then the process for creating the Asset metadata extraction will need to be manually triggered, and executed as described in the above links (Create Entities, Create Attributes, etc.).

Step 3: Define and create the data extraction pipeline to extract data from the source system

Once all the metadata regarding a set of Assets from a data source has been configured (Provider, Entities, Attributes) in step 1, the next step is to populate the metadata database with the information of the Azure Data Factory pipelines.

The movement of data in Sidra platform is orchestrated by Azure Data Factory (ADF). Every object in ADF is defined using a JSON structure that can be used along with the API and SDKs provided by Azure Data Factory (ADF) to deploy pipelines programmatically.

Sidra Core stores the information that defines a pipeline in the Core Database, by using a series of pipeline templates. Those ADF component templates are the JSON structures that define the component, but containing placeholders. When the placeholders are resolved with actual values, the resulting JSON structure can be used to create the Data Factory components in ADF.

Sidra data platform takes advantage of this capability and provides a metadata system for storing those JSON structures and a component to automatically deploy those pipelines in ADF:

  • The metadata system is composed by a set of tables in the metadata database that are generically called Data Factory metadata, located in the DataIngestion schema of the metadata database. Section Data Factory tables describes these tables in more detail.
  • Data Factory Manager is the internal Sidra component (the user does not need to configure anything for this), that uses the Data Factory metadata and programmatically deploys the ADF objects defined in it.

This system automates the deployment of pipelines so they can be easily setup in any environment -for example to deploy to a new test environment- at any time given. The idea is not only to reproduce the environment configuration in Azure Data Factory, but also being able to add new pipelines by adding information to the metadata.

The Data Factory metadata is organized using templates for each of the ADF components -activities, datasets, triggers, pipelines. This facilitates the creation of new pipelines by reusing the ones already created.

Section Configure new pipelines describes the step-by-step process to configure a new pipeline based on the templates provided by Sidra.

Configure the data extraction and ingestion pipeline

Sidra incorporates different out of the box pipelines and pipeline templates to orchestrate data extraction from source systems.

Data extraction pipeline templates create pipelines associated with the Provider in order to orchestrate the data extraction from the source.

The data extraction and ingestion pipelines to use vary depending on the type of data source.

For some types of data sources, like databases, Sidra comes with data extractor pipeline templates, like the BatchExtractAndIntakeSQL. Pipelines created out of this template perform transparently all actions from the actual extraction or movement of data from the source system, to the execution of the actual file ingestion into the Data Storage Unit. Steps 4 and 5 in this document describe how to deploy and execute these pipelines.

The specific pipeline to deploy and run is selected based on the configuration of the Entity to which the Asset is associated. There is an Entity-Pipeline association that needs to happen in order to associate an Entity with a data extraction pipeline. In the case of the pipeline template BatchExtractAndIntakeSQL there is an additional simplification setting where the all Entities from a Provided are automatically associated to the pipeline if a Provider is specified.

Step 4: Prepare or create scripts to create the tables in Databricks that will store the final data

This step involves the code auto-generation (not its execution) required in order to eventually execute the ingestion of the data into the Data Storage Unit (Databricks).

The page Data Ingestion describes the general steps involved in a very typical data ingestion in Sidra: file batch ingestion from the landing zone. Within this flow, it is explained that the part of actual data ingestion into the DSU is comprised of two stages: file registration and file ingestion in Databricks.

In the above page, it is explained that, after performing the actual data extraction from the data source into Sidra Landing page or similar, the actual file ingestion process is performed to actually ingest the data in an optimized way in the Data Storage Unit. The file ingestion is the process that reads the raw copy of the file and persists the information in an optimized format in the DSU, after executing some initial optimizations. Data ingestion in Databricks describes the process in more detail.

The specific File ingestion part has as a pre-requirement the generation of some scripts. Two different Spark scripts need to be created sequentially: Table creation and Transfer query scripts.

These scripts are created specifically for the Entity of the Asset to be ingested. This means that Assets from the same Entity will share the same scripts.

  • The Table creation script creates the necessary database and tables in the Databricks cluster. More detailed information is included in Table creation.
  • The Transfer query script reads the raw copy of the Asset and inserts the information in the tables created by the previous script in an optimized format. One transfer query script is generated per Entity. More information about this script can be found in Transfer query section.

In this step about preparing the file ingestion structures, the create tables and transfer query scripts are created with the parametrization of the Entity metadata details. Therefore, there will be one transfer query script generated per Entity. For a detailed explanation on the file ingestion step, you can check this other page.

Depending on the type of data source, we can distinguish where this step 2 (prepare the file ingestion structures) is actually being implemented:

  • The metadata extraction pipelines available for some database sources (SQL) implement transparently this step of preparing the transfer query scripts for data ingestion in Databricks. Afterwards, it is the actual data extraction pipeline which executes the transfer query scripts.

  • For the typical flow of file ingestion from the landing zone, the pipeline FileIngestionDatabricks contains a step to generate the transfer query scripts.

In a later step (step 5), the actual execution of the transfer query scripts will happen from a Databricks ADF activity.

Step 5: Deploy the data extraction and data ingestion pipeline

Once the different pipelines for data extraction have been created (for example, any of the ExtractFrom{data source name} these pipelines need to be deployed into Azure Data Factory.

Data extraction pipelines have a pre-requirement of having all the metadata about the sources configured. In case there is a metadata extractor pipeline available, this pipeline needs to be deployed and executed first before proceeding with the deployment and execution of the data extractor pipeline (see Step 1 in this document). Sidra API incorporates different methods for deploying ADF pipelines.

As described before, data extraction pipelines differ depending on the type of the data source.

Deploying an ADF pipeline can take several minutes depending on the status of the environment. Processing cluster resources (e.g. Databricks) are required in order to perform this operation in Sidra Core.

Configure the trigger to schedule the data extraction

Sidra incorporates different templates for ADF triggers in order to schedule the execution of the data extractor pipelines.

Triggers are basically representations of the specific schedule for running an ADF pipeline (example, daily executions at a fixed time).

Sidra provides TriggerTemplates in order to instantiate a new trigger. Sidra Core API provides an endpoint to create a new trigger from the Id of the template.

The purpose of the trigger is to execute an associated pipeline at the desired time. Existing triggers can also be used instead of creating a new trigger.

In any case a request needs to be made to Sidra API to associate the data extractor pipeline with the newly created or existing trigger.

Step 6: Execute the data extraction and ingestion pipelines

Once the data extraction and ingestion pipelines have been deployed, they need to be executed or run in order to perform the actual data extraction.

As described before, data extraction pipelines differ depending on the type of the data source.

When the actual ingestion process happens the following actions take place:

  • The file registration: this is the process of creating an Asset in the platform representing the file to be ingested in the DSU.

The result of the registration is the population of the correct data (registration of the Asset) in the Sidra Core intake metadata and control tables. The files are registered using the Sidra API.

  • The file ingestion is the process that reads the raw copy of the file and intakes the information in the DSU, after executing some initial optimizations (e.g. error validation, partitioning, etc.)

    The file ingestion sub-step is performed by an Azure Data Factory pipeline that will be selected depending on the configuration of the Entity associated to the Asset.

The file ingestion involves the execution of the transfer query scripts prepared for each Entity in step 2.

Depending on the type of data source, the extraction and ingestion pipelines differ in the number of steps included: Data extraction pipelines for database sources perform all the steps required to ingest the data into the Data Storage Unit, from the actual movement of data from the source, to the Asset registration and the actual execution of the transfer query scripts that perform the data optimization actions before storing the data in the Data Storage Unit (file ingestion). Therefore, in this case there is not a distinction between data extraction and data ingestion pipelines.