How an Asset flows into the platform

An Asset in Sidra Data Platform represents each of the data elements that get ingested into the platform. An Asset is a term that abstracts many different data formats.

Examples of Assets in Sidra Data Platform are:

  • An intermediate database extract into a CSV format.
  • A PDF file which is part of a larger collection of documents.

While Entities are the metadata structures inside the Sidra Metadata, Assets are the specific instances of data ingested into the platform.

The key components in Sidra Data Platform have been designed to identify, support, manipulate, move and query Assets in the platform.

As such, it is important to understand the different steps that are encompassed in the ingestion of an Asset into the system, starting in the origin, or data source, and finishing in the Data Storage Unit (DSU).

The sections below detail the process in an abstracted way.

Sidra Data Platform supports different mechanisms to perform some of the below steps. Therefore, for each type of data source and file there will be particularities in the pipelines or methods used to ingest the data.

Step 1. Extract data from the data source and convert it into a file

Any data from any data source is extracted and stored in the sytem as files.

Sidra Data Platform supports multiple data sources, but most generally the supported types in Sidra can be categorized into these groups:

  • Databases.
  • The results of an API call.
  • A set of files stored in an SFTP server.

The information extracted from the data source is stored in a file in one of the Sidra supported file types. Currently the supported file types are CSV and Parquet.

If the information stored in the data source is any other type of file -such as XML or JSON- then it first has to be converted to one of the supported file types.

The extraction of the information from the data source and the conversion to one of the supported file types are usually performed by Azure Data Factory (ADF) pipelines, as described in the Orchestration section.

Azure Data Factory includes many different connectors to a big variety of data sources. The most common and out of the box scenario of data extraction is achieved through Data Factory.

In case of complex data sources or specific restrictions in the data source, such as for custom logic to extract contents of an API, for example, the extraction can be done by any other custom or third party component.

The data extraction pipelines built through ADF pipelines in Sidra follow a naming convention: ExtractFrom{data source name} or LoadFrom{data source name}.

Sidra provides templates to create extraction pipelines for the most common scenarios.

Step 2. Copy the file with the extracted data to the landing zone

Once the information is extracted from the data source and the file and Asset are created in Sidra, the file with the actual data is copied to the landing zone in Sidra Core.

The landing zone is an Azure Storage container specific for each Data Storage Unit (DSU). Depending on the landing zone in which the file is stored, the system will know in which Databricks cluster this file must be ingested, as there is a 1-1 relationship between the Databricks intake cluster and each DSU.

The main goal of copying the file to the landing zone is triggering the ingestion process using the IngestFromLanding ADF pipeline (see Step 3. Launching ingestion process).

However, Sidra supports alternative ways to ingesting data: the Sidra Manager WebApp, which is deployed with every Sidra installation, supports the ingestion process in an alternative way without using the IngestFromLanding pipeline.

Step 3. Launch the ingestion process

Once the file is copied to the landing zone in Sidra, a Trigger configured in Azure Data Factory detects the new file and launches the ADF pipeline IngestFromLanding to ingest the file into the system.

The trigger executes the pipeline substituting its parametrized variables -folderPath and fileName- with the information of the new file detected.

The ingestion pipeline performs the following actions:

  1. Invoke Sidra API to register the file in the system. File registration concept is described in in the File registration section below.
  2. Copy the file from the landing zone to an Azure Storage container specific for the Data Storage Unit (DSU). This file copy is kept in the system as the "raw copy" of the file.
  3. Invoke Sidra API to ingest the file in the Data Storage Unit (DSU). More details are available in the File ingestion section below.
  4. Delete the copy of the file from the landing zone.

The ingestion process can be performed by the IngestFromLanding ADF pipeline, or by any other mechanism -e.g. using the Sidra Manager as described above.

To summarize, these are general sub-steps performed as part of the ingestion process in Sidra Core:

  • Sub-step 3.1: File registration
  • Sub-step 3.2: Storage of the raw copy of the file in the DSU
  • Sub-step 3.3: File ingestion

Sub-step 3.1 File registration

The file registration is the process of creating an Asset in the platform representing the file to be ingested in the DSU.
The result of the registration is the population of the correct data in the Sidra Core intake metadata and control tables.

The files are registered using the Sidra API and the process encompasses the following steps:

  1. Identify the Entity to which the file belongs. Every Entity in the metadata database in Sidra Core contains a RegularExpression column. This column encodes the pattern of filenames that will be followed by the Assets associated to the Entity. The name of the file will be checked against the regular expression patterns to determine the Entity that the file is going to be associated with.
  2. Once the Entity is identified, it is verified that the Data Storage Unit (DSU) in which the file is going to be ingested and the Entity to which the file is associated are the correct ones. In order to perform this check, Sidra checks a relationship in the system (metadata tables) between the Entities and the DSUs.

Sub-step 3.2 Storage of the raw copy of the file in the DSU

This step just consists of storing the raw copy of the registered file in the previous sub-step an Azure Storage container in the DSU.

Sub-step 3.3 File ingestion

The file ingestion is the process that reads the raw copy of the file and intakes the information in an optimized format in the DSU, after executing some initial optimizations.

The file ingestion sub-step is performed by an Azure Data Factory pipeline that will be selected depending on the configuration of the Entity associated to the Asset.

For example, if the Entity has been configured to encrypt the file at ingestion, then the pipeline used to ingest will be the FileIngestionWithEncryption pipeline.

This process is explained in more detail in How assets are ingested into Databricks section.

Changes in the system after the Asset ingestion

Once all these steps have been performed, the following objects will be created or updated in the platform:

  • A new asset will be added to the platform and the information about this Asset included in the Sidra Core metadata database.
  • A raw copy of the Asset will be stored in an Azure Storage container in the DSU.
  • An optimized copy of the information will included in the Azure Data Lake Storage in the DSU.