How Assets are ingested into Databricks

This section describes in more detail the step introduced in File Ingestion.

The Asset ingestion into Databricks is an automated process that starts from the raw copy of the Asset in the Blob Storage in the DSU, and ends with an optimized copy of the information in the Databricks DSU (Azure Data Lake Storage).

In order to automate the process, the ingestion is performed by an Azure Data Factory (ADF) pipeline.

The specific pipeline to deploy and run is selected based on the configuration of the Entity to which the Asset is associated.

File ingestion pipelines then include some scripts generation, which are created with the parametrization of the Entity metadata details, and then executed each time the data is actually being transferred.

Ingestion pipeline selection

Knowing the Entity to which the Asset is associated, the selection of the ADF pipeline follows these automated steps:

  1. Retrieve from the Sidra Core metadata tables the relationhip between pipelines and Entities. This relation is configured in the EntityPipeline table in the metadata database. This information contains all the pipelines associated to the Entity.
  2. From the previous list of pipelines, the first pipeline for ADF of the type LoadRegisteredAsset is chosen. The metadata database stores both ADF and Azure Search pipelines. In this selection process, only Azure Data Factory pipelines with type LoadRegisteredAsset are considered.
  3. If step 2 does not produce any selection, choose the pipeline with the same name as the Entity.
  4. If step 3 does not produce any selection, choose the default pipeline configured in the platform for file ingestion.

The default pipeline for file ingestion can be configured in the Management.Configuration table using GenericPipelineName as the key and the name of the pipeline as the value, e.g. FileIngestionDatabricks.

More information about the platform configuration tables can be found in the Management metadata section.

File ingestion pipelines

Every file ingestion pipeline can have its own particularities, but all of them work in a similar way.

These pipelines each generate and execute two Spark SQL query scripts.

When running these pipelines, the scripts are autogenerated specifically for the Entity of the Asset to be ingested. This means that Assets from the same Entity will share the same scripts.

There are two different scripts that are generated sequentially:

  • The Table creation script creates the necessary database and tables in the Databricks cluster. More detailed information about this script can be found in Table creation section.
  • The Transfer query script reads the raw copy of the Asset and inserts the information in the tables created by the previous script. One transfer query script is generated per Entity. More information about this script can be found in Transfer query section.

Both scripts are generated based on the Entity's metadata and can be stored in Databricks File System (DBFS) -so they can be executed by the Databricks cluster- or in an Azure Storage account.

The scripts will be generated only if the following set of conditions are satisfied:

  • The Entity's metadata has been updated since the last time the scripts were created.
  • The Entity's metadata flag ReCreateTableOnDeployment is active.