How an Asset flows into the platform via the Landing zone for file batch intake

An Asset in Sidra Data Platform represents an instance of each of the data elements that get ingested into the platform. An Asset is a term that abstracts many different data formats.

Examples of Assets in Sidra Data Platform are:

  • An intermediate database extract into a Parquet format file.
  • A PDF file which is part of a larger collection of documents.

While Entities are the metadata structures inside the Sidra Metadata, Assets are the specific instances of data ingested into the platform.

The key components in Sidra Data Platform have been designed to identify, support, manipulate, move and query Assets in the platform.

As such, it is important to understand the different steps that are encompassed in the ingestion of an Asset into the system, starting in the origin, or data source, and finishing in the Data Storage Unit (DSU).

The sections below detail the process in an abstracted way.

Sidra Data Platform supports different mechanisms to perform some of the below steps.

It is important to note that certain types of data ingestion flows extract directly the data from the data source (e.g. SQL Database) to the raw format in the Data Storage Unit, without the intermediate step of depositing them to the landing zone. For certain other types of data flows, such as for the generic batch file intake process described in this page, the landing zone is used as the starting point for file ingestion.

The below steps are the typical steps for the generic file-base batch intake in Sidra via the landing zone.

This includes a couple of general pre-steps, which can be executed by an external process, or by a custom data extraction pipeline in Sidra:

  • Pre-step 1: Extract data from the data source and convert it into a file
  • Pre-step 2: Copy the file with the extracted data to the landing zone.

On the setup and configuration side, if these processes run inside Sidra, the needed pipelines will need to be deployed in Sidra and executed inside the Data Storage Unit. On the execution side, these pipelines will run according to the specified associated triggers for these pipelines. Trigger objects will need to be created and associated to the data extraction pipelines. You can visit the Data Factory tables for more general information about Azure Data Factory metadata hosted in Sidra. As a result of this execution, the data is deposited as files in a container in Sidra storage called landing zone. The files need to be deposited to the landing zone following a set of conventions, like a specific file naming convention, and in a folder path that has been agreed and mirrors the structure of Providers/ Entities (Asset metadata) in Sidra.

Once the files are in the landing zone, a file ingestion process is triggered every time there is a new drop of data in the landing zone.

The involved pipelines for file ingestion from the landing zone come preinstalled with any Sidra installation. The Data Source also does not need to be created explicitly.

The only actual explicit required step for configuring a new data intake process from the landing zone will be to configure the Asset metadata. This step can be executed through invoking the specific Sidra metadata API endpoints. This step consists of creating Provider, Entities and Attributes required for specifying the structure of the data to be ingested.

After the metadata structures have been generated in Sidra for this type of data intake process, the actual data ingestion starts, which includes two of the sub-steps from the previous section:

  • File (Asset) registration.
  • File ingestion. For a detailed explanation on the file registration and file ingestion processes, you can check this other page.

Below are the lower-level details about the pre-steps and actual step of data ingestion:

Pre-Step 1. Extract data from the data source and convert it into a file

Data from any data source is extracted and stored in the system as files.

Sidra Data Platform supports multiple data sources, in general, the supported types in Sidra can be categorized into these groups:

  • Databases.
  • The results of an API call.
  • A set of files stored in an SFTP server.

The information extracted from the data source is stored in a file in one of the Sidra supported file types. Data can be deposited in the landing zone as parquet, csv or json files.

The extraction of the information from the data source and the conversion of the supported file types are usually performed by Azure Data Factory (ADF) pipelines, as described in the Orchestration section.

Azure Data Factory includes many different connectors to a big variety of data sources. The most common and out-of-the box scenario of data extraction is achieved through Azure Data Factory.

In case of complex data sources or specific restrictions in the data source, such as for custom logic to extract contents of an API, for example, the extraction can be done by any other custom or third party component.

This step just involves the movement of data from a source system (e.g. API, source database, etc.) into Sidra landing zone.

Data extraction pipelines are created associated with the Provider in order to orchestrate the data extraction from the source.

The data extraction pipelines built through ADF pipelines in Sidra follow a naming convention: ExtractFrom{data source name}or LoadFrom{data source name}. These involve the actual extraction or movement of data from the source system to the landing Zone in Sidra storage. Sidra provides templates to create extraction pipelines for the most common scenarios.

Pre-step 2. Copy the file with the extracted data to the landing zone

In the case of generic file batch intake, once the information is extracted from the data source, the file with the actual data is copied to the landing zone in Sidra Core.

The landing zone is an Azure Storage container specific for each Data Storage Unit (DSU). Depending on the landing zone in which the file is stored, the system will know in which Databricks cluster this file must be ingested, as there is a 1-1 relationship between the Databricks intake cluster and each DSU.

The main goal of copying the file to the landing zone is to trigger the data ingestion process.

Pre-step 3. Configure the metadata (Entities, Attributes) for the data intake process

This step is only required when configuring the data intake process for the first time, or whenever there is any change in the file structures (e.g. added columns). Tutorial pages Add new Provider, Add new Entity and Add new Attribute.

Another requirement done in this step is to associate each Entity with the data ingestion pipeline responsible for ingesting the data. See page Associate Entity to pipeline for more details.

Step 1. Launch the ingestion process from the Landing zone

To summarize, these are general sub-steps performed as part of the file ingestion process from landing zone in Sidra Core:

  • Sub-step 3.1: File registration.
  • Sub-step 3.2: Storage of the raw copy of the file in the DSU.
  • Sub-step 3.3: File ingestion.

All these sub-steps are described below:

Once the file is copied to the landing zone in Sidra, a Trigger configured in Azure Data Factory (Blob storage file created) detects the new file and launches the ADF pipeline RegisterAsset to ingest the file into the system.

The trigger executes the pipeline substituting its parametrized variables -folderPath and fileName- with the information of the new file detected.

This pipeline actually performs what is known as File registration process in Sidra.

For file registration to succeed, it is important that the Entity and Attributes for the data intake process (metadata) have been configured prior to the first data ingestion execution.

See Tutorials section Add new Asset to see more information about Assets creation in Sidra.

The RegisterAsset pipeline also copies the file from the landing zone to an Azure Storage container (raw container) specific for the Data Storage Unit (DSU). This file copy is kept in the system as the raw copy of the file.

The RegisterAsset will finally call another pipeline, which is the actual data ingestion pipeline. For the file ingestion from the landing zone, the data ingestion pipeline used is FileIngestionDatabricks

For the file registration to succeed, the pre-requirements described in Pre-step 3 need to be in place. This means that the created Entities will need to be associated to the pipeline FileIngestionDatabricks.

The data ingestion pipeline FileIngestionDatabricks performs the following actions:

  1. Invoke Sidra API to ingest the file in the Data Storage Unit (DSU). More details are available in the File ingestion section below.
  2. After the data ingestion pipeline has finished, the calling pipeline RegisterAsset will delete the copy of the file from the landing zone.

Sub-step 3.1 File registration

The file registration is the process of creating an Asset in the platform representing the file to be ingested in the DSU.
The result of the registration is the population of the correct data in the Sidra Core intake metadata and control tables.

The files are registered using the Sidra API and the process encompasses the following steps:

  1. Identify the Entity to which the file belongs. Every Entity in the metadata database in Sidra Core contains a RegularExpression column. This column encodes the pattern of filenames that will be followed by the Assets associated to the Entity. The name of the file will be checked against the regular expression patterns to determine the Entity that the file is going to be associated with.
  2. Once the Entity is identified, it is verified that the Data Storage Unit (DSU) in which the file is going to be ingested and the Entity to which the file is associated are the correct ones. In order to perform this check, Sidra checks a relationship in the system (metadata tables) between the Entities and the DSUs.

Additionally, the Entity table needs to have some important metadata for the file ingestion to happen successfully. Sidra Entity metadata table contains two date fields: StartValidDate and EndValidDate.

  • StartValidDate: this is the start date when the Entity is considered valid in the system. Assets will be ingested as long as the StartValidDate has passed. The Asset metadata table contains as well as a field called AssetDate. This date is not a timestamp (does not contain hour, minute, second information), so, by default 00:00:00 will be applied. In case of issues when importing the Asset, please review if this may be the cause of the issue.

  • EndValidDate: this is the end valid date when the Entity is considered valid in the system. This may be used for marking certain Entities as "inactive" in the system with regards to new data ingestions.

Sub-step 3.2 Storage of the raw copy of the file in the DSU

This step just consists of storing the raw copy of the registered file in the previous sub-step an Azure Storage container in the DSU.

Sub-step 3.3 File ingestion

The file ingestion is the process that reads the raw copy of the file and intakes the information in an optimized format in the DSU, after executing some optimizations.

The file ingestion sub-step is performed by an Azure Data Factory pipeline that will be selected depending on the configuration of the Entity associated to the Asset.

If the Entity has been configured to encrypt the file at ingestion, from 1.8.2, still the pipeline to be used will be FileIngestionDatabricks pipeline. For previous versions, the pipeline to be used in this case was FileIngestionWithEncryption.

This process is explained in more detail in How Assets are ingested into Databricks section.

Changes in the system after the Asset ingestion

Once all these steps have been performed, the following objects will be created or updated in the platform:

  • A new Asset will be added to the platform and the information about this Asset included in the Sidra Core metadata database.
  • A raw copy of the Asset will be stored in an Azure Storage container in the DSU.
  • An optimized copy of the Data Lake Storage data is stored in the data lake storage in the DSU.