Data Factory Metadata¶
The movement of data in Sidra platform is performed by Azure Data Factory (ADF). Every object in ADF is defined using a JSON structure that can be used along with the API and SDKs provided by ADF to deploy pipelines programmatically.
Sidra Service stores the information that defines a pipeline in the Sidra Database, by using a series of pipeline templates. Those ADF component templates are the JSON structures that define the component, but containing placeholders. When the placeholders are resolved with actual values, the resulting JSON structure can be used to create the Data Factory components in ADF.
Sidra data platform takes advantage of this capability and provides a metadata system for storing those JSON structures and a component to automatically deploy those pipelines in ADF:
- The metadata system is composed by a set of tables in the metadata database that are generically called Data Factory metadata, located in the DataIngestion schema of the metadata database.*
- Data Factory Manager is the internal Sidra component (the user does not need to configure anything for this), that uses the Data Factory metadata and programmatically deploys the ADF objects defined in it.
This system automates the deploying of pipelines so they can be easily setup in any environment -for example deploy to a new test environment- at any time given. The idea is not only to reproduce the environment configuration in Azure Data Factory, but also being able to add new pipelines by adding information to the metadata.
The Data Factory metadata is organized using templates for each of the ADF components -activities, datasets, triggers, pipelines. This facilitates the creation of new pipelines by reusing the ones already created.
Metadata and Data Factory Association¶
On the one hand, the Sidra's Platform keeps some metadata information about the files ingested in the Data Lake. That metadata includes the association of every file to an Entity. On the other hand, each Entity is associated with a pipeline by means of the EntityPipeline
table. At last, the Pipeline
table contains a reference to the Data Factory in which the pipeline will be executed.
By following the previous chain of associations, it can be seen that the files and theirs metadata are associated to a specific Data Factory.