Data movement and orchestration in Data Products¶
The Data Products solutions use Azure Data Factory V2 (ADF) to perform data movements -same as Sidra Sidra Service solutions-, but the deployment project creates its own instance of ADF in the Resource Group of the Data Product.
When working with both solutions at the same time -Sidra Service and Data Products- it is important to differentiate the ADF instances. More specifically, there will be:
- an ADF instance for each Data Storage Unit (DSU) in Sidra Service.
- an ADF instance for each Data Product solution.
Understanding the synchronization between Sidra Sidra Service and Data Products¶
Sidra provides templates to create Data Products that can be configured to automatically retrieve the information from Data Storage Units (DSUs).
The process of discovering and extracting this information is called Data Sync, which means the data synchronization between a Data Product and Sidra Service.
Sidra provides a component that orchestrates this synchronization, the Sync job. This job is deployed in all Data Products with the template provided by Sidra. Without any additional configuration, the metadata in the Sidra Service database is already synchronized with the metadata in the Data Product database.
The synchronization will take place as long as the Data Product has the right permissions to synchronize with the data in the DSU.
The Sync job is configured to be executed every 2 minutes.
Sync job¶
The Sync
job uses Sidra API to retrieve latest information from the metadata tables. The metadata returned by the API is conditioned/limited by the permissions granted to the Data Product.
Based on the metadata received from the API, for each metadata table, the job updates its Data Product metadata tables. For Provider and Entity tables, any entry that is no longer available in Sidra Service is set as disabled
in Client using the above mentioned IsDisabled
field.
All this information will be used to decide if there is new content in the Data Lake to be imported to the Data Product.
In addition to this, the Sync job will also be responsible for executing the defined pipeline, depending on the sync behavior defined for each pipeline (see PipelineSyncBehavior
), described here.
Dummy Assets¶
Sidra Service has the concept of dummy Assets, which are Assets of zero length that get created when an incremental load in Sidra Service finishes but results in no new increment of data. This concept was introduced in order to force the presence of a new Asset in the Data Product metadata tables. Without these Assets, the data synchronization would not be generated if the Assets are configured as mandatory Assets (see below information on this point). If these Assets are mandatory but not generated, the data movement would not happen and this could affect the business logic on the Data Product. The generation of dummy Assets in Data ingestion pipelines is optional, with default set to false
. Therefore, if the data processing logic of a Data Product needs to have these Assets generated, please ensure that this parameter has the correct setting to true
when deploying data intake pipelines.
Extracting the new content from the Data Storage Unit¶
The Sidra Data Products will use ADF for the data movement, in particular they will use Sync Modes (pipelines) for the extraction of new content from the Data Lake (more specifically, from the DSU).
The actions performed by the extraction pipelines will depend on what is going to be done with the content after the extraction. This logic is Data Product-specific and tied to business rules transformations. Some examples of the actions possibly executed by the extraction pipelines are:
- The content may be used to populate a DataWarehouse inside the Data Product. In this case, such content will be first stored into the staging tables in the Data Product after the extraction.
- The content will be optionally transformed through the execution of data transformations and business rules within the Data Product.
- Optionally, this transformed data can be re-ingested as rich data back into the Data Lake. In this case, after the extraction and transformation the new content will be pushed to the landing zone in Sidra Service for the re-ingestion, as any new data Provider for Sidra is configured.
More information about Data Product pipelines can be checked here.