How to ingest a new asset from scratch¶
The purpose of this section is to provide a guide of the steps required to ingest a new type of content into Sidra platform. It is important to understand which metadata about the new content must be generated and how that metadata is structured. There is more information about that topic in the Metadata hierarchical model section.
In a brief, there is two groups of tasks that must be completed:
- Metadata configuration
- Data movement orchestration
Metadata configuration¶
The metadata is stored in the metadata database -the name of the database in Azure is Core, so it is also referred as the Core database-. Each one of the elements of the metadata model is stored in a table of that database. Configuring the metadata is basically inserting information in those tables.
The configuration will depend on the content to be ingested and the metadata information that already exists in the database. For example, it could be necessary to add a new provider or it could be used one of the existing.
The most common path to configure the metadata is the following one:
- Add a new provider if none of the existing can be used.
- Add a new entity.
- Add a new attribute for each of the columns of the new type of content.
- Add new attribute format if some of the columns need a tranformation in order to be ingested in the Data Lake.
Data movement orchestration¶
The section Asset flow into the platform describes the complete flow of an asset from the data source to the Data Lake. Many of the data movements in that journey are already implemented and orchestrated by Sidra. Some others need to be configured but Sidra provides some accelerators to easy the task. The steps to configure and orchestrate the data movements are:
- Review the methods to add an asset to the platform and select the most appropriate depending on the requirements. The most common scenario is using Data Factory and internal pipelines.
- In case of using Data internal pipelines, it will be necessary to configure the new pipeline.
- Finally it is required to associate the entity with the Data Factory pipelines, not only the new created pipeline but also the FileIngestionDatabricks pipeline which is used in all the asset adding methods. This section describes how to associate an entity with a pipeline.