How to configure a new data source in Sidra

This section describes the general process of setting up a new data source in Sidra, including references to common concepts and pointers to specific methods for setting up certain types of data sources.

Introduction

Sidra supports several out of the box components as well as accelerators for setting up the ingestion of new data sources.

The movement of data in Sidra platform is orchestrated by Azure Data Factory (ADF). Every object in ADF is defined using a JSON structure that can be used along with the API and SDKs provided by Azure Data Factory (ADF) to deploy pipelines programmatically.

Data sources to be ingested Sidra can be of a huge number of types.

The most common types of data sources have out of the box support in Sidra. Among these, the most common database engines (SQL, MySQL, DB2) and file-based ingestion (e.g. from a landing zone).

For those types of sources not supported out of the box, Sidra offers common infrastructure and accelerators for setting them up with minimal development and configuration effort. This is possible thanks to Sidra's support of a common metadata framework to register and configure the abstractions to define the data to be ingested, as well as the pipeline generation and automation machinery to set up the ingestion infrastructure.

The configuration of these data sources can be performed by different mechanisms.

Before delving into the different mechanisms to perform the configuration of data sources in Sidra, it is important to explain the key concepts and steps involved.

Basic steps and concepts

Setting up a new data source for data ingestion in Sidra involves several important steps:

  • Step 1: Configure the metadata for this data source (e.g. Provider/Entity/Attribute).
  • Step 2: Prepare the file ingestion structures (scripts) for performing the file ingestion into the Data Storage Unit.
  • Step 3: Configure the metadata for the connections and data pipelines to be used for this data source.
  • Step 4: Deploy the data integration and ingestion pipelines.
  • Step 5: Execute the data integration and ingestion pipelines, including the ingestion into the Data Storage Unit.

Depending the type of data source some of these steps may be simplified or executed together.

Also depending on the type of data source, Sidra incorporates some additional components that abstract some of the details of these steps and wrap them into common API endpoints, or even expose as self-service UI interface as Sidra connectors. See section on Connectors wizard for more details on what Sidra connectors are and how they can be used from Sidra Web to set up new data sources.

Also depending on the type of data source, Sidra incorporates already deployed out-of-the-box pipelines, such as the case of the file ingestion from the Landing Zone. In this case, it is not necessary to perform an explicit deployment or manual execution of the pipelines.

Below is a high-level description of what these steps mean and why they are required. The specifics for different types of data sources will also be commented there for more clarity.

Step 1: Configure the metadata for a data source

Section Sidra metadata model hierarchy contains an explanation of the key pieces that conform the Sidra metadata model, namely DSU, Providers, Entities and Attributes.

Any new data source to be configured in Sidra requires this metadata to be populated as a first to represent the actual data to be ingested. Sidra metadata is stored in a DataIngestion schema inside the Sidra Core database. The metadata information contains information about the Entities to be ingested, as well as how these Entities should be handled by the system.

Sidra adds additional system-level attributes to convey system-handling information at the level of the Entity (or table).

These are the different steps required in order to set up the metadata for a new data source in Sidra. Links to tutorial pages are included for each of the steps for a generic data source:

  • Create a new Provider (if required) new data source will be composed of one or several Entities. Such Entities can be part of an existing Provider (because they are logically related to that Provider), or we can choose to associate these Entities to an existing Provider. More details on how to set up a Provider in Sidra are here. For example, a whole database could be considered a Provider or part of a Provider.
  • Create Entity/Entities. A new data source is composed of at least one new Entity. For example, each table in an SQL database will be considered each a different Entity. More details on how to configure an Entity in Sidra are here.
  • Create new Attributes. For example, to specify each of the fields of a database table or a CSV file. More details on how to configure an Attribute in Sidra are here.
  • Create an AttributeFormat (if needed): More details on how to configure an AttributeFormat in Sidra are here.

Methods to configure the metadata for a data source

Depending on the type of data source, Sidra incorporates helper methods and accelerators to do the above steps, and configure the actual file ingestion into the DSU.

Most generally, there are four main ways to populate the metadata in Sidra Core database:

  • By means of SQL scripts (see details in above tutorial links).
  • By means of API calls (see details in above tutorial links). This is a recommended option over the SQL scripts method, if the data source does not include yet the support for a Sidra Connector (see next point).
  • For specific data sources (SQL database), Sidra incorporates accelerators like metadata extraction pipelines: Metadata extractor pipelines create a pipeline associated with the Provider in order to automatically retrieve and infer all the metadata from the data source. Examples of the pipeline templates generating these type of pipelines are: Insert metadata using database schema for TSQL and Insert metadata using database schema for DB2. These type of pipelines can be deployed and executed through API calls. Tutorial How to configure tables to exclude describes how to configure tables to be excluded from the SQL data extraction when using a SQL metadata extraction pipeline.

The above-mentioned metadata extraction pipelines available for some sources, like SQL databases, will be used for two main purposes internally in Sidra:

  • Create and populate the Sidra Core metadata structures (Entities and Attributes), from the information of the different tables and columns in the source database system. This would encompass the above actions in Step 1 for creating Entities and Attributes.
  • Prepare for the actual file ingestion into the Data Storage Unit by auto-generating the create tables and transfer query scripts for each Entity, including all the steps for storing the optimized data in the data lake storage. This is described in Step 2 in this document.

Once the metadata extractor pipeline has been created, it needs to be executed. This execution could take a while (from 5 to 15 minutes) depending on the load of Data Factory and the status and load of the associated Databricks cluster. Sidra exposes API endpoints as well to manually execute the SQL Metadata extractor pipeline.

Step 2: Prepare the file ingestion structures

This step involves the code auto-generation (not its execution) required in order to eventually execute the ingestion of the data into the Data Storage Unit (Databricks).

The page Data Ingestion describes the general steps involved in a very typical data ingestion in Sidra: file batch ingestion from the landing zone. Within this flow, it is explained that the part of actual data ingestion into the DSU is comprised of two stages: file registration and file ingestion in Databricks.

In the above page, it is explained that, after performing the actual data extraction from the data source into Sidra Landing page or similar, the actual file ingestion process is performed to actually ingest the data in an optimized way in the Data Storage Unit. The file ingestion is the process that reads the raw copy of the file and persists the information in an optimized format in the DSU, after executing some initial optimizations. Data ingestion in Databricks describes the process in more detail.

The specific File ingestion part has as a pre-requirement the generation of some scripts. Two different Spark scripts need to be created sequentially: Table creation and Transfer query scripts.

These scripts are created specifically for the Entity of the Asset to be ingested. This means that Assets from the same Entity will share the same scripts.

  • The Table creation script creates the necessary database and tables in the Databricks cluster. More detailed information is included in Table creation.
  • The Transfer query script reads the raw copy of the Asset and inserts the information in the tables created by the previous script in an optimized format. One transfer query script is generated per Entity. More information about this script can be found in Transfer query section.

In this step about preparing the file ingestion structures, the create tables and transfer query scripts are created with the parametrization of the Entity metadata details. Therefore, there will be one transfer query script generated per Entity.

Depending on the type of data source, we can distinguish where this step 2 (prepare the file ingestion structures) is actually being implemented:

  • The metadata extraction pipelines available for some database sources (SQL) implement transparently this step of preparing the transfer query scripts for data ingestion in Databricks. Afterwards, it is the actual data extraction pipeline which executes the transfer query scripts.

  • For the typical flow of file ingestion from the landing zone, the pipeline FileIngestionDatabricks contains a step to generate the transfer query scripts.

In a later step (step 5), the actual execution of the transfer query scripts will happen from a Databricks ADF activity.

Step 3: Configure the connections and data pipelines to be used

Once all the metadata regarding a set of Assets from a data source has been configured (Provider, Entities, Attributes) in step 1, the next step is to populate the metadata database with the information of the Azure Data Factory pipelines.

The movement of data in Sidra platform is orchestrated by Azure Data Factory (ADF). Every object in ADF is defined using a JSON structure that can be used along with the API and SDKs provided by Azure Data Factory (ADF) to deploy pipelines programmatically.

Sidra Core stores the information that defines a pipeline in the Core Database, by using a series of pipeline templates. Those ADF component templates are the JSON structures that define the component, but containing placeholders. When the placeholders are resolved with actual values, the resulting JSON structure can be used to create the Data Factory components in ADF.

Sidra data platform takes advantage of this capability and provides a metadata system for storing those JSON structures and a component to automatically deploy those pipelines in ADF:

  • The metadata system is composed by a set of tables in the metadata database that are generically called Data Factory metadata, located in the DataIngestion schema of the metadata database. Section Data Factory tables describes these tables in more detail.
  • Data Factory Manager is the internal Sidra component (the user does not need to configure anything for this), that uses the Data Factory metadata and programmatically deploys the ADF objects defined in it.

This system automates the deployment of pipelines so they can be easily setup in any environment -for example to deploy to a new test environment- at any time given. The idea is not only to reproduce the environment configuration in Azure Data Factory, but also being able to add new pipelines by adding information to the metadata.

The Data Factory metadata is organized using templates for each of the ADF components -activities, datasets, triggers, pipelines. This facilitates the creation of new pipelines by reusing the ones already created.

Section Configure new pipelines describes the step-by-step process to configure a new pipeline based on the templates provided by Sidra.

Configure the data source

One of the key elements of ADF that need to be created and deployed in ADF besides pipelines is the Linked Service.

Linked Services in Sidra are abstracted under the term data source. A data source in Sidra represents the connection to the source data system.

A Data Source abstracts the details of creating a Linked Service in Azure Data Factory. They behave as connection strings, allowing Data Factory to connect to different data sources: SQL servers, REST endpoints, Blob folders, etc.

Usually, the fields required for setting up a data source are the Integration Runtime and the connection string in the case of a database.

For configuring a new data source, Sidra requires to register a new data source in Sidra metadata, which in turn deploys a Linked Service in Azure Data Factory with this connection. The created Linked Service will use Key Vault in Sidra Core to store the connection string to the database.

Sidra provides several data source templates for common sources of data, e.g. "AzureSql", "HttpServer", etc.

For more details on Linked Services check the Data Factory documentation.

Sidra provides REST endpoints to register a new data source from a template, as well as to automatically deploy such infrastructure in Azure Data Factory for the pipelines to use. This is the recommended method for registering a data source in Sidra and therefore deploying a Linked Service in ADF. For some types of data sources, like SQL database, there is even a more automated and friendly way, that is using Connectors from the Web. For more details on the manual process of what constitutes the creation of a new linked service from a template, you can visit the page Connecting to new data sources through linked services

Configure the data extraction and ingestion pipeline

Sidra incorporates different out of the box pipelines and pipeline templates to orchestrate data extraction from source systems.

Data extraction pipeline templates create pipelines associated with the Provider in order to orchestrate the data extraction from the source.

  • For the first step of actually extracting the data from the source system, Sidra comes with several templates for data extraction pipelines. The data extraction pipelines built through ADF pipelines in Sidra follow a naming convention: LoadFrom{data source name}. These involve the actual extraction or movement of data from the source system to the landing zone in Sidra storage. Sidra provides templates to create extraction pipelines for the most common scenarios.

  • For the step of ingesting the data into the DSU, Sidra also offers data ingestion pipelines. Data ingestion pipelines execution includes the process of actually ingesting the data from the landing zone into the DSU (e.g. Databricks) in its optimized format. See File ingestion. The generation of the ingestion processing logic, namely the transfer query scripts, was done in Step 2 above.

The data extraction and ingestion pipelines to use vary depending on the type of data source:

  • Some types of data sources, like SQL, come with data extractor pipeline templates, like the BatchExtractAndIntakeSQL. Pipelines created out of this template perform transparently all actions from the actual extraction or movement of data from the source system, to the execution of the actual file ingestion into the Data Storage Unit. Steps 4 and 5 in this document describe how to deploy and execute these pipelines.

  • For file ingestion from the Landing Zone, it is not necessary to explicitly deploy any pipeline, as the pipelines required come pre-installed already with the Sidra installation. The pipelines used in this case are two: RegisterAsset and FileIngestionDatabricks. See the tutorial Add Asset for more information:

    • RegisterAsset performs the steps of adding the Asset in Sidra from the API, moving the data to the right container from the landing zone, and calling the FileIngestionDatabricks pipeline. See details on Asset flow into the platform. This pipeline will be triggered and executed automatically whenever there is a new file uploaded to the landing zone.

    • FileIngestionDatabricks is called from the RegisterAsset pipeline and it performs the steps of preparing the transfer query script (step 2), as well as executing the transfer query for the file ingestion.

    These pipelines do not include the actual data extraction step (as they depart from the file already in the landing Zone), but focus on the file registration and the file ingestion in Databricks. These pipelines in turn can be used directly in Sidra or can be used by other pipelines.

In any case, the specific pipeline to deploy and run is selected based on the configuration of the Entity to which the Asset is associated. There is an Entity-Pipeline association that needs to happen in order to associate an Entity with a data extraction pipeline. In the case of the pipeline template BatchExtractAndIntakeSQL there is an additional simplification setting where the all Entities from a Provided are automatically associated to the pipeline if a Provider is specified.

Step 4: Deploy the data extraction and data ingestion pipeline

Once the different pipelines for data extraction have been created (for example, any of the ExtractFrom{data source name} these pipelines need to be deployed into Azure Data Factory.

Data extraction pipelines have a pre-requirement of having all the metadata about the sources configured. In case there is a metadata extractor pipeline available, this pipeline needs to be deployed and executed first before proceeding with the deployment and execution of the data extractor pipeline (see Step 1 in this document). Sidra API incorporates different methods for deploying ADF pipelines.

As described before, data extraction pipelines differ depending on the type of the data source.

Deploying an ADF pipeline can take several minutes depending on the status of the environment. Processing cluster resources (e.g. Databricks) are required in order to perform this operation in Sidra Core.

Step 5- Execute the data extraction and ingestion pipelines

Once the data extraction and ingestion pipelines have been deployed, they need to be executed or run in order to perform the actual data extraction.

As described before, data extraction pipelines differ depending on the type of the data source.

When the actual ingestion process happens the following actions take place:

  • The file registration: this is the process of creating an Asset in the platform representing the file to be ingested in the DSU.

The result of the registration is the population of the correct data (registration of the Asset) in the Sidra Core intake metadata and control tables. The files are registered using the Sidra API.

  • The file ingestion is the process that reads the raw copy of the file and intakes the information in the DSU, after executing some initial optimizations (e.g. error validation, partitioning, etc.)

    The file ingestion sub-step is performed by an Azure Data Factory pipeline that will be selected depending on the configuration of the Entity associated to the Asset.

The file ingestion involves the execution of the transfer query scripts prepared for each Entity in step 2.

Depending on the type of data source, the extraction and ingestion pipelines differ in the number of steps included:

  • Data extraction pipeline for SQL sources performs all the steps required to ingest the data into the Data Storage Unit, from the actual movement of data from the source, to the Asset registration and the actual execution of the transfer query scripts that perform the data optimization actions before storing the data in the Data Storage Unit (file ingestion). Therefore, in this case there is not a distinction between data extraction and data ingestion pipelines.
  • As mentioned above, for file-based ingestion from the landing zone, the process is a bit different. As the actual data extraction has already happened to the landing zone, the process is completed through just the pipelines of data ingestion. Two separate pipelines are involved in the process of asset registration and file ingestion in Databricks. These two pipelines are RegisterAsset, and FileIngestionDatabricks, respectively.

Configure the trigger to schedule the data extraction

Sidra incorporates different templates for ADF triggers in order to schedule the execution of the data extractor pipelines.

Triggers are basically representations of the specific schedule for running an ADF pipeline (example, daily executions at a fixed time).

Sidra provides TriggerTemplates in order to instantiate a new trigger. Sidra Core API provides an endpoint to create a new trigger from the Id of the template.

The purpose of the trigger is to execute an associated pipeline at the desired time. Existing triggers can also be used instead of creating a new trigger.

In any case a request needs to be made to Sidra API to associate the data extractor pipeline with the newly created or existing trigger.