How batch files are ingested into Databricks¶
To Sidra data intake understanding, it is important to know about the different steps that are encompassed in the ingestion of an Asset into the system, starting in the origin, or data source, and finishing in the Data Storage Unit (DSU).
The sections below detail the process in an abstracted way. Sidra Data Platform supports different mechanisms to perform some of the below steps.
Steps for generic file-base batch intake via landing zone¶
Step 1. File batch in landing zone and metadata configuration¶
In this process, there are a list of general steps, which can be executed by an external process, or by a custom data extraction pipeline in Sidra:
- Configure the Data Source.
- Extract data from the data source and convert it into a file.
- Copy the file with the extracted data to the landing zone.
- Configure the metadata (Entities, Attributes) for the Data Intake.
- Configure new pipelines for data extraction.
- Deploy the data extraction and data ingestion pipeline.
On the setup and configuration side, if these processes run inside Sidra, the needed pipelines will need to be deployed in Sidra and executed inside the Data Storage Unit. On the execution side, these pipelines will run according to the specified associated triggers for these pipelines. Trigger objects will need to be created and associated to the data extraction pipelines.
You can visit the Data Factory tables for more general information about Azure Data Factory metadata hosted in Sidra.
Summary
As a result of this execution:
- The data is deposited as files in a container in Sidra storage called landing zone.
- The files need to be deposited to the landing zone following a set of conventions, like a specific file naming convention, and in a folder path that has been agreed and mirrors the structure of Providers/ Entities (Asset metadata) in Sidra.
- Once the files are in the landing zone, a file ingestion process is triggered every time there is a new drop of data in the landing zone.
- The involved pipelines for file ingestion from the landing zone come preinstalled with any Sidra installation. The Data Source also does not need to be created explicitly.
- The only actual explicit required step for configuring a new Data Intake from the landing zone will be to configure the Asset metadata. This step can be executed through invoking the specific Sidra metadata API endpoints. This step consists of creating Provider, Entities and Attributes required for specifying the structure of the data to be ingested.
1. Configure and create a Data Source¶
Refer to the Linked Services page and its section for configuring a Data Source.
2. Extract data from the data source and convert it into a file¶
Sidra Data Platform supports multiple data sources, in general, the supported types in Sidra can be categorized into these groups:
- Databases.
- The results of an API call.
- A set of files stored in an SFTP server.
In case of complex data sources or specific restrictions in the data source, such as for custom logic to extract contents of an API, for example, the extraction can be done by any other custom or third party component.
The step of extraction and conversion of data from the data source just involves the movement of data from this source system into Sidra landing zone. There, the information extracted is stored in a file in one of the Sidra supported file types as parquet
, csv
or json
files.
The extraction and the conversion of the supported file types are usually performed by Azure Data Factory (ADF) pipelines, as described in the Orchestration section. Data extraction pipelines are associated to an Entity by an EntityPipeline association in order to orchestrate the data extraction from the source to the landing Zone in Sidra storage. Azure Data Factory includes many different connectors to a big variety of data sources. The most common and out-of-the box scenario of data extraction is achieved through Azure Data Factory.
The data extraction pipelines built through ADF pipelines in Sidra follow a naming convention: ExtractFrom{data source name}
or LoadFrom{data source name}
. Sidra provides templates to create extraction pipelines for the most common scenarios.
3. Copy the file with the extracted data to the landing zone¶
In the case of generic file batch intake, once the information is extracted from the data source, the file with the actual data is copied to the landing zone in Sidra Service.
The landing zone is an Azure Storage container specific for each Data Storage Unit (DSU). Depending on the landing zone in which the file is stored, the system will know in which Databricks cluster this file must be ingested, as there is a 1-1 relationship between the Databricks intake cluster and each DSU.
The main goal of copying the file to the landing zone is to trigger the data ingestion process.
4. Configure the metadata (Entities, Attributes) for the Data Intake¶
Section about Sidra metadata model hierarchy and a general metadata overview contains an explanation of the key pieces that conform the Sidra metadata model, namely DSU, Providers, Entities and Attributes.
Sidra adds additional system-level Attributes to convey system-handling information at Entity (or table) level.
These are the different steps required in order to set up the metadata for a new data source in Sidra. Links to tutorial pages are included for each of the steps for a generic data source:
- Create a new Provider (if required). A new data source will be composed of one or several Entities. Such Entities can be part of an existing Provider (because they are logically related to that Provider), or we can choose to associate these Entities to an existing Provider. For example, a whole database could be considered a Provider or part of a Provider.
- Create Entity/Entities. A new data source is composed of at least one new Entity. For example, each table in an SQL database will be considered each a different Entity.
- Create new Attributes. For example, to specify each of the fields of a database table or a CSV file.
This step is only required when configuring the Data Intake for the first time, or whenever there is any change in the file structures (e.g. added columns).
Another requirement done in this step is to associate each Entity with the data ingestion pipeline responsible for ingesting the data.
Step-by-step
Add new Provider
This section explains how to add a new Provider .
Add new Entity
This section explains how to add a new Entity .
Add new Attribute
This section explains how to add a new Attribute .
Associate Entity to pipeline
This section explains how to associate Entities to pipelines .
Methods to configure the metadata for a Data Intake¶
Depending on the type of data source, Sidra incorporates helper methods and accelerators to do the above steps, and configure the actual file ingestion into the DSU.
Most generally, these are the ways to populate the metadata in Sidra Service database:
Metadata population in Sidra Service database
- See details in above tutorial links.
- This is a recommended option over the SQL scripts method, if the data source does not include yet the support for a Sidra Connector (see next point).
- For most of databases (e.g., SQL database), Sidra incorporates accelerators like this.
- Metadata extractor pipelines create a pipeline associated with the Provider in order to automatically retrieve and infer all the metadata from the data source.
- Examples of the pipeline templates generating these type of pipelines are: Insert metadata using database schema for TSQL and Insert metadata using database schema for DB2.
- These type of pipelines can be deployed and executed through API calls.
5. Define and create the data extraction pipeline to extract data from the source system¶
Once all the metadata regarding a set of Assets from a data source has been configured (Provider, Entities, Attributes), the next step is to populate the metadata database with the information of the Azure Data Factory pipelines.
For more information, check section ADF pipelines.
6. Deploy the data extraction and data ingestion pipeline¶
More information can be checked in this section.
Step 2. Launch the ingestion process from the Landing zone¶
To summarize, these are general steps performed when launching the ingestion process:
- File registration.
- Storage of the raw copy of the file in the DSU.
- File ingestion.
Summary
As a result of this:
- Once the file is copied to the landing zone in Sidra, a trigger configured in Azure Data Factory (Blob storage file created) detects the new file and launches the ADF pipeline RegisterAsset to ingest the file into the system.
- The trigger executes the pipeline substituting its parametrized variables -folderPath and fileName- with the information of the new file detected.
- This pipeline actually performs what is known as File registration process in Sidra. For file registration to succeed, it is important that the Entity and Attributes for the Data Intake (metadata) have been configured prior to the first data ingestion execution.
- The RegisterAsset pipeline also copies the file from the landing zone to an Azure Storage container (raw container) specific for the Data Storage Unit (DSU). This file copy is kept in the system as the raw copy of the file.
- The RegisterAsset will finally call another pipeline, which is the actual data ingestion pipeline.
- For the file ingestion from the landing zone, the data ingestion pipeline used is FileIngestionDatabricks.
- For the file registration to succeed, the pre-requirements described in the metadata configuration step need to be in place. This means that the created Entities will need to be associated to the pipeline FileIngestionDatabricks. The data ingestion pipeline FileIngestionDatabricks performs the following actions:
- Invoke Sidra API to ingest the file in the Data Storage Unit (DSU). More details are available in the File ingestion section below.
- After the data ingestion pipeline has finished, the calling pipeline RegisterAsset will delete the copy of the file from the landing zone.*
1. File registration¶
The file registration is the process of creating an Asset in the platform representing the file to be ingested in the DSU. Then, the result of the registration is the population of the correct data in the Sidra Service intake metadata and control tables.
The files are registered using the Sidra API and the process encompasses the following steps:
-
Identify the Entity to which the file belongs. Every Entity in the metadata database in Sidra Service contains a
RegularExpression
column. This column encodes the pattern of filenames that will be followed by the Assets associated to the Entity. The name of the file will be checked against the regular expression patterns to determine the Entity that the file is going to be associated with. -
Once the Entity is identified, it is verified that the Data Storage Unit (DSU) in which the file is going to be ingested and the Entity to which the file is associated are the correct ones. In order to perform this check, Sidra checks a relationship in the system (metadata tables) between the Entities and the DSUs.
Additionally, the Entity table needs to have some important metadata for the file ingestion to happen successfully. Sidra Entity metadata table contains two date fields:
StartValidDate
andEndValidDate
.-
StartValidDate
: this is the start date when the Entity is considered valid in the system. Assets will be ingested as long as theStartValidDate
has passed. The Asset metadata table contains as well as a field calledAssetDate
. This date is not a timestamp (does not contain hour, minute, second information), so, by default 00:00:00 will be applied. In case of issues when importing the Asset, please review if this may be the cause of the issue. -
EndValidDate
: this is the end valid date when the Entity is considered valid in the system. This may be used for marking certain Entities as "inactive" in the system with regards to new data ingestions.
-
2. Storage of the raw copy of the file in the DSU¶
This step just consists of storing the raw copy of the registered file in the previous step an Azure Storage container in the DSU.
3. File ingestion¶
The file ingestion is the process that reads the raw copy of the file and intakes the information in an optimized format in the DSU, after executing some optimizations.
The file ingestion step is performed by an Azure Data Factory pipeline that will be selected depending on the configuration of the Entity associated to the Asset.
If the Entity has been configured to encrypt the file at ingestion, from 1.8.2, still the pipeline to be used will be FileIngestionDatabricks
pipeline. For previous versions, the pipeline to be used in this case was FileIngestionWithEncryption
.
Changes in the system after the Asset ingestion¶
Once all these steps have been performed, the following objects will be created or updated in the platform:
- A new Asset will be added to the platform and the information about this Asset included in the Sidra Service metadata database.
- A raw copy of the Asset will be stored in an Azure Storage container in the DSU.
- An optimized copy of the Data Lake Storage data is stored in the data lake storage in the DSU.