How assets are ingested into Databricks

The asset ingestion into Databricks is an automatic process that starts from the raw copy of the asset and ends with an optimized copy of the information in Databricks. In order to automatize the process, the ingestion is performed by an Azure Data Factory (ADF) pipeline which is selected based on the configuration of the entity to which the asset is associated.

Ingestion pipeline selection

Knowing the entity to which the asset is associated, the selection of the ADF pipeline follows these steps:

  1. Use the relation between pipelines and entities established by the EntityPipeline table in the metadata database to retrieve all the pipelines related to the entity.
  2. From the previous list, select the first pipeline for ADF of the type LoadRegisteredAsset. The metadata database stores both ADF and Azure Search pipelines. In this selection process, there are only considered Azure Data Factory pipelines with type LoadRegisteredAsset.
  3. If step 2 does not produce any selection, select the pipeline with the same name that the entity.
  4. If step 3 does not produce any selection, use the default pipeline configured in the platform for file ingestion.

The default pipeline for file ingestion can be configured in the Management.Configuration table using GenericPipelineName as the key and the name of the pipeline as the value, e.g. FileIngestionDatabricks. More information about the platform configuration tables can be found in the Management metadata section.

File ingestion pipelines

Every file ingestion pipeline can have its own particularities but all of them work in a similar way, they ingest the asset by generating and executing two Spark SQL query scripts. The scripts are created specifically for the entity of the asset to be ingested, that means that assets from the same entity will share the same scripts.

The scripts are generated based on the entity's metadata and can be stored in Databricks File System (DBFS) -so they can be executed by the Databricks cluster- or in an Azure Storage account. The scripts will be generated only if the following set of conditions are satisfied:

  • The entity's metadata has been updated since the last time that the scripts were created.
  • The entity's metadata flag ReCreateTableOnDeployment is active.

The scripts generated are:

  • Table creation. It creates the necessary database and tables in the Databricks cluster. More information about it can be found in Table creation section.
  • Transfer query. It reads the raw copy of the asset and insert the information in the tables created by the previous script. More information about it can be found in Transfer query section.