Sidra Data Intake Process overview¶
In the Sidra Data Platform Overview it is described how Sidra is an end-to-end data platform whose key approach to integrate with source systems and bring the data to the platform domain is a data-lake approach. Bringing data to the platform domain is required to access data that is usually in silos in operational systems, most of the times on-prem, and which needs to be made available for analytics consumption.
The data lake is just one of the first steps in the overall architecture to allow the fast setup of data products and applications based on analytical data. The data lake is used to standardize the Data Intake Process of different data sources and their mapping to Sidra Metadata system. Thanks to this standardization it is possible to carry out data governance use cases, like security, granular access control and define and enforce data integration standards.
You can see more information about the Sidra metadata model related to data ingestion in this page.
The Azure Data Lake Storage Gen2 (ADLS Gen2) is the service where all the data for every data Provider is added to the system for making it available for downstream consumption.
In opposition to the traditional Data Warehouses, data lakes store the information in the most pure and raw format possible (the concept of immutable data lake), whether it is structured or unstructured data. This allows to ease the data ingestion logic: shifting the paradigm from ETL (extract-transform-load) to ELTs (extract-load-transform), and to focus on the usage of this data by each Client Application.
The next key piece in Sidra for the end-to-end platform is the concept of Sidra Client Applications. Client Applications in Sidra are the pieces of Sidra Data Platform that enable business cases. These data business cases encompass the specific business transformation logic (application specific data transformations and validations) and the serving of a data interface to serve final use cases (e.g., reporting) or to further expose to other consuming external applications in the enterprise.
Just as a key concept of the Sidra metadata system, it is worth explaining here what an Asset in Sidra is. An Asset in Sidra Data Platform represents an instance of each of the data elements that get ingested into the platform. An Asset is a term that abstracts many different data formats.
While Entities are the metadata structures inside the Sidra Metadata that represent the structure of the tables to be ingested, Assets are the specific instances of data ingested into the platform (data drops). The key components in Sidra Data Platform have been designed to identify, support, manipulate, move and query Assets in the platform.
Overview of a Data Intake Process configuration¶
The processes by which data is ingested from the source systems to its final storage in the data lake are referred to as Data Intake Processes in Sidra. Data Intake Processes in Sidra are set up through different mechanisms, which range from UI configuration wizard to making a serial of Sidra API calls. In order to configure a new Data Intake Process, several generic steps need to happen. This page will introduce these key generic steps for configuring and creating a new Data Intake Process in Sidra, regardless of the specific implementation.
Some key features about a Data Intake Process would be:
A Data Intake Process is basically mapped to just one data source (we cannot have more than one data source per Data Intake Process). A Provider (see Assets metadata section) is just a logical container of different Data Intake Processes, so we could have several Data Intake Processes associated to the same Provider in Sidra Metadata system.
Sidra supports several out of the box components as well as accelerators for setting up the ingestion of new data sources.
The movement of data in Sidra platform is orchestrated by Azure Data Factory (ADF). Every object in ADF is defined using a JSON structure that can be used along with the API and SDKs provided by Azure Data Factory (ADF) to deploy pipelines programmatically.
Data sources to be ingested Sidra can be of different types:
- The most common types of data sources have out of the box support in Sidra. Among these, the most common database engines (SQL, MySQL, DB2) and file-based ingestion (e.g. from a landing zone).
- For those types of sources not supported out of the box, Sidra offers common infrastructure and accelerators for setting them up with minimal development and configuration effort. This is possible thanks to Sidra's support of a common metadata framework to register and configure the abstractions to define the data to be ingested, as well as the pipeline generation and automation machinery to set up the ingestion infrastructure.
The configuration of these data sources can be performed by different mechanisms. On the implementation side, Sidra incorporates different artifacts and accelerators:
For example, Sidra has an out-of-the box process for ingesting data from the so called landing zone containers, which are basic storage containers where the sets of extracted data from the source systems are deposited based on an agreement. This process can be complemented with separate data extraction pipelines that just cover for the extraction of the data from the source systems (e.g., services or APIs) and the deposit of this data in the landing zone.
For database types of sources, however, Sidra incorporates full end-to-end data extraction and ingestion (extract and load) pipelines, which include the movement of the data from the database source systems all the way until its final storage in optimized format in the data lake.
Data intake generic process for end-to-end extract and load pipelines¶
This is the type of Data Intake Process configured for data sources that are of type database, e.g. Azure SQL, SQL Server, DB2, etc.
The generic steps for configuring a Data Intake Process are the following:
- Step 1. Configure and create the Data Source. This creates an underlying ADF Linked Service used in ADF to actually connect to the source system.
- Step 2. Define and configure the Asset metadata for the Data Intake Process (e.g., Provider/Entity/Attribute).
- Step 3. Define and create the data extraction pipeline to actually extract data from the source system at defined scheduled intervals.
- Step 4. Prepare or create scripts to create the tables in Databricks that will store the final data (Table creation and Transfer Query scripts).
- Step 5. Deploy the data extraction pipeline and associate a trigger.
For a detailed explanation of each of these data intake configuration steps, please continue in this page.
After these configuration steps have been completed, a new Data Intake Process will be completed.
Depending on the type of data source some of these steps may be simplified or executed together. Also depending on the type of data source, Sidra incorporates some additional components that abstract some of the details of these steps and wrap them into a set of self-service UI interface wizard steps. This is thanks to the plugin approach described in Sidra connectors section.
See section on Connectors wizard for more details on what Sidra connectors are and how they can be used from Sidra Web to set up new Data Intake Processes.
The following steps involve the actual periodic execution of the configured extract and load process:
- Step 6. Execution of the data intake (Extract and Load) process.
- Sub-step 6.1. Extract data from source: the data is extracted directly from the data source (e.g. SQL Database), to the container that hosts the raw format files in the Data Storage Unit. In this case the files do not go through the landing zone.
- Sub-step 6.2. File (Asset) registration.
- Sub-step 6.3. File ingestion.
- File registration. This is the process of creating an Asset in the platform representing the file to be ingested in the DSU. The result of the registration is the population of the correct data (registration of the Asset) in the Sidra Core intake metadata and control tables. The files are registered using the Sidra API.
- File ingestion. This is the process that reads the raw copy of the file and intakes the information in the DSU, after executing some initial optimizations (e.g. error validation, partitioning, etc.). For a detailed explanation on the file ingestion step, you can check this other page.
Depending on the type of data source some of the above configuration steps may be simplified or executed together. Also depending on the type of data source, Sidra incorporates some additional components that abstract some of the details of these steps and wrap them into a set of self-service UI interface wizard steps. This is thanks to the plugin approach described in Sidra connector plugins section.
Data Intake Process types¶
As it follows, there can be several types of Data Intake Processes.
1. Data Intake Process from the landing zone¶
Some key features about a Data Intake Process from the landing zone would be:
Sidra incorporates already deployed out-of-the-box pipelines for the file ingestion from the Landing Zone. In this case, it is not necessary to perform an explicit deployment or manual execution of the pipelines.
This type of Data Intake Process is usually selected for certain types of data sources in semi-structured format, for example:
- When the data can be deposited through some external data extraction process in e.g., .parquet or .csv format.
- When there is a separate data extraction pipeline developed that actually extracts semi-structured data files (e.g., JSON) from services or APIs.
For a detailed explanation of a complete Data Intake Process using the landing zone, please continue in this page.
2. Data intake generic process for binary file ingestion¶
Sidra incorporates a separate process for binary file (document) ingestion that is a bit different from the above two processes.
Although some of the above concepts of file registration and file ingestion are also performed in this type of data intake, there are some significant differences as per the indexing processing steps that need to happen for the files.
Among these binary files to ingest, there can be also a data ingestion from Excel files, which is a specialized sub-type of data ingestion from landing, that requires additional scripts for metadata configuration and a specialized transfer query script.
Azure Search is the key service that will be responsible for applying cognitive skills on the binary files (documents). The process usually starts by depositing the files in a special landing zone container called indexlanding.