Sidra Data Intake mechanisms¶
Before continuing with the specific steps for a Data Intake, it is a pre-requirement to understand the Sidra Metadata section key concepts.
The processes by which data is ingested from the source systems to its final storage in the data lake are referred to as Data Intake Processes in Sidra. Data Intake Processes (DIPs) in Sidra are set up through different mechanisms, which range from UI configuration wizard to making a serial of Sidra API calls. On the other hand, the Data Intake is part of the process of ingestion without contemplating an end-to-end process fully configurable via UI as the DIP does.
Key features of DIPs and Data Intakes¶
Some key features about a Data Intake Process would be:
A Data Intake Process is basically mapped to just one data source (we cannot have more than one data source per Data Intake Process).
A Provider (see Assets metadata section) is defined in Sidra as a logical container of different Entities, that could come from different data sources. Theoretically, this means that we could have several Data Intake Processes associated to the same Provider in Sidra Metadata system. However, please note as in Sidra connectors overview, that currently it is not possible to create a Data Intake Process assigning it to an existing Provider. A Data Intake Process creation today requires to create a new Provider as well. Sidra roadmap includes a feature to create a Data Intake Process to an existing Provider.
Sidra supports several out of the box components as well as accelerators for setting up the ingestion of new data sources.
The movement of data in Sidra platform is orchestrated by Azure Data Factory (ADF). Every object in ADF is defined using a JSON structure that can be used along with the API and SDKs provided by Azure Data Factory (ADF) to deploy pipelines programmatically.
Data sources to be ingested Sidra can be of different types:
- The most common types of data sources have out of the box support in Sidra. Among these, the most common database engines (SQL, MySQL, DB2) and file-based ingestion (e.g. from a landing zone).
- For those types of sources not supported out of the box, Sidra offers common infrastructure and accelerators for setting them up with minimal development and configuration effort. This is possible thanks to Sidra's support of a common metadata framework to register and configure the abstractions to define the data to be ingested, as well as the pipeline generation and automation machinery to set up the ingestion infrastructure.
The configuration of these data sources can be performed by different mechanisms. On the implementation side, Sidra incorporates different artifacts and accelerators:
For Data Intakes, for example, Sidra has an out-of-the box process for ingesting data from the so called landing zone containers, which are basic storage containers where the sets of extracted data from the source systems are deposited based on an agreement. This process can be complemented with separate data extraction pipelines that just cover for the extraction of the data from the source systems (e.g., services or APIs) and the deposit of this data in the landing zone.
For database types of sources (DIPs), however, Sidra incorporates full end-to-end data extraction and ingestion (extract and load) pipelines, which include the movement of the data from the database source systems all the way until its final storage in optimized format in the data lake.
Data intake generic process for end-to-end extract and load pipelines¶
This is a Data Intake Process configured for data sources that are of type database, e.g. Azure SQL, SQL Server, DB2, etc.
The generic steps for configuring a Data Intake Process are the following:
- Step 1. Configure and create the Data Source. This creates an underlying ADF Linked Service used in ADF to actually connect to the source system.
- Step 2. Define and configure the Asset metadata for the Data Intake Process (e.g., Provider/Entity/Attribute).
- Step 3. Define and create the data extraction pipeline to actually extract data from the source system at defined scheduled intervals.
- Step 4. Prepare or create scripts to create the tables in Databricks that will store the final data (Table creation and legacy autogenerated transfer query / DSU ingestion scripts).
- Step 5. Deploy the data extraction pipeline and associate a trigger.
Configure a new data intake
- For a detailed explanation of each of these configuration steps in Data Intake via landing zone, please continue in this page .
- For a detailed explanation of each of these configuration steps in Data Intake with document indexing, please continue in this page .
- For a detailed explanation of each of these configuration steps in Data Intake Processes via connector plugins, please continue in this page .
After these configuration steps have been completed, a new Data Intake Process will be completed.
Depending on the type of data source some of these steps may be simplified or executed together. Also depending on the type of data source, Sidra incorporates some additional components that abstract some of the details of these steps and wrap them into a set of self-service UI interface wizard steps. This is thanks to the plugin approach described in Sidra connectors section.
See section on Connectors wizard for more details on what Sidra connectors are and how they can be used from Sidra Web to set up new Data Intake Processes.
The following steps involve the actual periodic execution of the configured extract and load process:
- Step 6. Execution of the data intake (Extract and Load) process.
- Sub-step 6.1. Extract data from source: the data is extracted directly from the data source (e.g. SQL Database), to the container that hosts the raw format files in the Data Storage Unit. In this case the files do not go through the landing zone.
- Sub-step 6.2. File (Asset) registration. This is the process of creating an Asset in the platform representing the file to be ingested in the DSU. The result of the registration is the population of the correct data (registration of the Asset) in the Sidra Core intake metadata and control tables. The files are registered using the Sidra API.
- Sub-step 6.3. File ingestion. This is the process that reads the raw copy of the file and intakes the information in the DSU, after executing some initial optimizations (e.g. error validation, partitioning, etc.).
Depending on the type of data source some of the above configuration steps may be simplified or executed together. Also depending on the type of data source, Sidra incorporates some additional components that abstract some of the details of these steps and wrap them into a set of self-service UI interface wizard steps. This is thanks to the plugin approach described in Sidra connector plugins section.