How an Asset flows into the platform via the Landing zone for file batch intake

An Asset in Sidra Data Platform represents an instance of each of the data elements that get ingested into the platform. An Asset is a term that abstracts many different data formats.

Examples of Assets in Sidra Data Platform are:

  • An intermediate database extract into a Parquet format.
  • A PDF file which is part of a larger collection of documents.

While Entities are the metadata structures inside the Sidra Metadata, Assets are the specific instances of data ingested into the platform.

The key components in Sidra Data Platform have been designed to identify, support, manipulate, move and query Assets in the platform.

As such, it is important to understand the different steps that are encompassed in the ingestion of an Asset into the system, starting in the origin, or data source, and finishing in the Data Storage Unit (DSU).

The sections below detail the process in an abstracted way.

Sidra Data Platform supports different mechanisms to perform some of the below steps.

It is important to note that certain types of data ingestion flows extract directly the data from the data source (e.g. SQL Database), until the raw format in the Data Storage Unit, without the intermediate step of depositing them to the landing zone. For certain other types of data flows, such as for the generic batch file intake process, we will use the landing zone file ingestion as the mechanism.

Therefore, for each type of data source and file there will be particularities in the pipelines or methods used to ingest the data.

The below steps are the typical steps for the generic file-base batch intake in Sidra via the landing zone. This includes the general steps for data extraction to the landing zone, and data ingestion from the landing zone to the Data Storage Unit.

Other types of data sources, e.g. SQL databases, abstract all below steps 1, 2 and 3 into a unique data extraction pipeline, which transparently performs the actual data movement out of the source system, and the ingestion into Databricks.

Step 1. Extract data from the data source and convert it into a file

Data from any data source is extracted and stored in the sytem as files.

Sidra Data Platform supports multiple data sources, but most generally the supported types in Sidra can be categorized into these groups:

  • Databases.
  • The results of an API call.
  • A set of files stored in an SFTP server.

The information extracted from the data source is stored in a file in one of the Sidra supported file types. Currently Parquet is the supported file format to store in Sidra.

The extraction of the information from the data source and the conversion the supported file types are usually performed by Azure Data Factory (ADF) pipelines, as described in the Orchestration section.

Azure Data Factory includes many different connectors to a big variety of data sources. The most common and out-of-the box scenario of data extraction is achieved through Azure Data Factory.

In case of complex data sources or specific restrictions in the data source, such as for custom logic to extract contents of an API, for example, the extraction can be done by any other custom or third party component.

This step just involves the movement of data from a source system (e.g. API, source database, etc.) into Sidra storage (landing zone).

As described in Configure a new data source section, depending on the data type, Sidra incorporates different templates of data extraction pipelines for the most common scenarios.

As mentioned above, in the case of some data sources, such as for SQL database, the extract data step does not deposit the data to the landing zone, but just moves the data from the source system to the raw zone in Sidra Core storage. In this case steps 1, 2 and 3 are abstracted into a single pipeline.

Data extraction pipelines are created associated with the Provider in order to orchestrate the data extraction from the source.

The data extraction pipelines built through ADF pipelines in Sidra follow a naming convention: ExtractFrom{data source name}or LoadFrom{data source name}. These involve the actual extraction or movement of data from the source system to the landing Zone in Sidra storage.

Step 2. Copy the file with the extracted data to the landing zone

In the case of generic file batch intake, once the information is extracted from the data source and the file and Asset are created in Sidra, the file with the actual data is copied to the landing zone in Sidra Core.

Assets are automatically created in Sidra when the files with the data are copied to the Sidra landing zone. When a file is copied to the landing zone, a pipeline called RegisterAsset is automatically triggered, which adds the Asset to the Assets metadata table.

See Tutorials section Add new Asset to see more about Assets in Sidra.

The Landing zone is an Azure Storage container specific for each Data Storage Unit (DSU). Depending on the landing zone in which the file is stored, the system will know in which Databricks cluster this file must be ingested, as there is a 1-1 relationship between the Databricks intake cluster and each DSU.

The main goal of copying the file to the landing zone is to trigger the ingestion process.

However, Sidra supports alternative ways to ingesting data: the Sidra Manager WebApp, which is deployed with every Sidra installation, supports the ingestion process in an alternative way.

Step 3. Launch the ingestion process from the Landing zone

Once the file is copied to the landing zone in Sidra, a Trigger configured in Azure Data Factory (Blob storage file created) detects the new file and launches the ADF pipeline RegisterAsset to ingest the file into the system.

The trigger executes the pipeline substituting its parametrized variables -folderPath and fileName- with the information of the new file detected.

The ingestion pipeline performs the following actions:

  1. Invoke Sidra API to register the file in the system. File registration concept is described in in the File registration section below.
  2. Copy the file from the landing zone to an Azure Storage container specific for the Data Storage Unit (DSU). This file copy is kept in the system as the "raw copy" of the file.
  3. Invoke Sidra API to ingest the file in the Data Storage Unit (DSU). More details are available in the File ingestion section below.
  4. Delete the copy of the file from the landing zone.

The ingestion process can be performed by the FileIngestionDatabricks ADF pipeline, or by any other mechanism -e.g. using the Sidra Manager as described above.

To summarize, these are general sub-steps performed as part of the file ingestion process from landing zone in Sidra Core:

  • Sub-step 3.1: File registration.
  • Sub-step 3.2: Storage of the raw copy of the file in the DSU.
  • Sub-step 3.3: File ingestion.

Sub-step 3.1 File registration

The file registration is the process of creating an Asset in the platform representing the file to be ingested in the DSU.
The result of the registration is the population of the correct data in the Sidra Core intake metadata and control tables.

The files are registered using the Sidra API and the process encompasses the following steps:

  1. Identify the Entity to which the file belongs. Every Entity in the metadata database in Sidra Core contains a RegularExpression column. This column encodes the pattern of filenames that will be followed by the Assets associated to the Entity. The name of the file will be checked against the regular expression patterns to determine the Entity that the file is going to be associated with.
  2. Once the Entity is identified, it is verified that the Data Storage Unit (DSU) in which the file is going to be ingested and the Entity to which the file is associated are the correct ones. In order to perform this check, Sidra checks a relationship in the system (metadata tables) between the Entities and the DSUs.

Additionally, the Entity table needs to have some important metadata for the file ingestion to happen successfully. Sidra Entity metadata table contains two date fields: StartValidDate and EndValidDate.

  • StartValidDate: this is the start date when the Entity is considered valid in the system. Assets will be ingested as long as the StartValidDate has passed. The Asset metadata table contains as well a field called AssetDate. This date is not a timestamp (does not contain hour, minute, second information), so, by default 00:00:00 will be applied. In case of issues when importing the Asset, please review if this may be the cause of the issue.

  • EndValidDate: this is the end valid date when the Entity is considered valid in the system. This may be used for marking certain Entities as "inactive" in the system with regards to new data ingestions.

Sub-step 3.2 Storage of the raw copy of the file in the DSU

This step just consists of storing the raw copy of the registered file in the previous sub-step an Azure Storage container in the DSU.

Sub-step 3.3 File ingestion

The file ingestion is the process that reads the raw copy of the file and intakes the information in an optimized format in the DSU, after executing some initial optimizations.

The file ingestion sub-step is performed by an Azure Data Factory pipeline that will be selected depending on the configuration of the Entity associated to the Asset.

If the Entity has been configured to encrypt the file at ingestion, from 1.8.2, still the pipeline to be used will be FileIngestionDatabricks pipeline. For previous versions, the pipeline to be used in this case was FileIngestionWithEncryption.

This process is explained in more detail in How Assets are ingested into Databricks section.

Changes in the system after the Asset ingestion

Once all these steps have been performed, the following objects will be created or updated in the platform:

  • A new Asset will be added to the platform and the information about this Asset included in the Sidra Core metadata database.
  • A raw copy of the Asset will be stored in an Azure Storage container in the DSU.
  • An optimized copy of the informatData Lake Storage in the DSU.