How an asset flows into the platform

Assets are the main element in Sidra Data Platform. The rest of elements are included to support, manipulate, move and query assets in the platform. So it is important to understand the steps that follows the ingestion of an asset into the system, starting in the data source and finishing in the Data Lake.

1. Extract data from the data source and convert into a file

Any data from any source is extracted and stored in files in the system. The source can be a database, an API, a set of files stored in a SFTP server or any other data source. The information extracted is stored in a file in one of the Sidra supported file types (currently CSV and Parquet). If the information stored in the data source is in any other type of file -such as XML or JSON- then it has to be converted to one of the supported file types.

The extraction of the information from the data source and the conversion to one of the supported file types are usually performed by Azure Data Factory (ADF) pipelines. In case of complex data sources or specific restrictions in the data source, the extraction can be done by any other third party component, it is not mandatory to be ADF pipelines, it is just the most common scenario. Those ADF pipelines are specific for each data source and usually named ExtractFrom{data source name} or LoadFrom{data source name}. Sidra provides templates to create extraction pipelines for the most common scenarios.

2. Copy the file to the landing zone

Once the information is extracted and the file created, it is copied to the landing zone. The landing zone is an Azure Storage container specific for each Data Storage Unit (DSU). Depending on the landing zone in which the file is copied, the system will know in which Databricks it must be ingested since there is a Databricks associated to each DSU.

The goal of copying the file to the landing zone is triggering the ingestion process using the IngestFromLanding ADF pipeline (see Step 3. Launching ingestion process) but it is not the only way, for example, when using the Sidra Manager -a WepApp deployed with Sidra to manage the platform- the ingestion process is realized without using the IngestFromLanding pipeline.

3. Launching ingestion process

Once the file is copied to the landing zone, a trigger configured in Azure Data Factory detects the new file and launches the ADF pipeline IngestFromLanding to ingest the file into the system. The trigger executes the pipeline covering its parameters -folderPath and fileName- with the information of the new file detected. The pipeline performs the following actions:

  1. Uses the Sidra API to register the file in the system. More details are available in File registration section below.
  2. Copies the file from the landing zone to an Azure Storage container specific for the Data Storage Unit (DSU). This copy is kept in the system as the "raw copy" of the file.
  3. Uses the Sidra API to ingest the file in the Data Lake. More details are available in File ingestion section below.
  4. Deletes the copy of the file from the landing zone.

The ingestion process can be performed by the IngestFromLanding or by any other mechanism -e.g. using the Sidra Manager- anyway it must include:

  • File registration
  • Storage of the raw copy of the file in the DSU
  • File ingestion

File registration

The file registration is the process of creating an asset in the platform that represents the file to be ingested in the Data Lake. The files are registered using the Sidra API and the process encompasses the following steps:

  1. Identify the entity to which the file belongs. Every entity in the metadata database contains information in the RegularExpression column about the pattern of filenames that will be followed by the assets associated to the entity. The name of the file will be checked against all those patterns to see which one matches with and so know the entity the file must be associated with.
  2. Knowing the entity, it is verified that the Data Storage Unit (DSU) in which the file is going to be ingested and the entity to which the file is associated are right. In order to perform this check, there is a relationship in the system between the entities and the DSUs.

File ingestion

The file ingestion is the process that reads the raw copy of the file and intakes the information in the Data Lake. It is performed by an Azure Data Factory pipeline that will be selected depending on the configuration of the entity associated to the file, for example, if the entity is configured to encrypt the file then it will use the FileIngestionWithEncryption pipeline.

This process is explained in more detail in How assets are ingested into Databricks section.

Changes in the system after the asset ingestion

After the last step, the platform has been updated in these ways:

  • A new asset is added to the platform and the information about it is included in the metadata database.
  • A raw copy of the asset is stored in an Azure Storage container in the Data Lake
  • An optimized copy of the information is included in the Data Lake.