Data Mesh Data Product¶
Sidra provides a specific template for a Data Product that manages and hosts relational modelling of data according to business rules, as well as the hosting of this data as part of an included Azure Search resource. In addition, the different database resources for this Data Product are provided as part of an SQL Elastic Pool, thus allowing to scale up while saving costs.
This Data Product template has the name of Data Mesh Data Product.
For more information on SQL Elastic Pool in Azure you can access the Azure Elastic Pool documentation.
For more information on Azure Search you can access the Azure Search Service documentation.
Purpose¶
The Data Mesh Data Product template is based on the Basic SQL template but provides Azure Search and Elastic Pool services with two databases by default, one with the Sidra metadata and another for the client data side.
The components for this Data Product template are:
-
SQL Elastic Pool, which can be scaled up/down automatically using the appropriate ADF pipeline template and will host two databases:
- The Sidra database, hosting the reduced version copy of the metadata tables in Sidra Service , in order to track Assets metadata as well as Data Factory metadata and configuration.
- The Data Mesh database, hosting the staging tables and the relational model. These transformed data models are exposed via APIs and Azure Search.
-
Azure Search, which is aimed to define deep searches over the client database data.
This document refers to key concepts of a Data Product in Sidra, which can be reviewed here.
This Data Product template allows to accelerate the creation of a Data Product, by abstracting all the main synchronization elements with Sidra Service . As long as the Data Product which is created with this template is configured to have the required permissions to access the DSU data, this application transparently and automatically retrieves data from the DSU into the staging tables.
This Data Product integrated with Sidra shares the common security model with Sidra Service and uses Identity Server for authentication. A copy of the relevant ingested assets metadata is kept always synchronized with Sidra Service . The metadata synchronization is performed by an automated Sync job, explained here.
The actual data flow orchestration is performed via a specific instance of Azure Data Factory installed in the Data Product.
High-level installation details¶
As with any other type of Data Product in Sidra, the process of installing this Data Product consists of the following main steps:
- A dotnet template is installed, which launches a build and release pipeline in Azure DevOps defined for this Data Product.
- As part of the build and release pipeline for this Data Product, the needed infrastructure is installed. This includes the execution of the Deploy.ps1 deployment script, and also the different WebJobs deployment.
This Data Product is configured once per environment, by using a key-value list of elements, called Azure DevOps variables, or variable group. There is one variable group in DevOps per environment.
In order to build and deploy the Data Product, it is enough to execute the pipeline Sidra.App.DataMesh
in the desired Git branch, dependign on the environment (e.g., dev, test or prod). The Azure DevOps pipeline executes two actions:
- Application build
- Deployment of the Azure resources
Build+Release is performed with multi-stage pipelines, so no manual intervention is required once the template is installed by default. For more information on these topics you can access this Documentation, and this tutorial.
Architecture¶
The Data Mesh Data Product resources are contained into a single resource group, separated from the Sidra Service and DSU resource groups. The services included in the ARM template for this Data Product contain the following pieces:
- Storage account for raw data: used for storing the copy of the data that is extracted from the DSU, and for which the Data Product has access.
- Data Factory: used for data orchestration pipelines to bring the data from the DSU.
- Elastic Pool: used for sharing resources between the Data Product’s databases:
- Sidra Database: used for keeping a synchronized copy of the Assets metadata between Sidra Service and Data Labs.
- Client Database: used for hosting the relational models and transformation stored procedures.
- Key Vault: used for storing and accessing secrets in a secure way.
- Azure Search: used for defining deep searches over the client database data.
Also, in this template the possibility to deploy login/users automatically has been included. More specifically, the role datameshaccess
and login/user FWPUSER
have been added. The passwords are retrieved from the Data Product Key Vault, so they need to be included there so that the logins can be successfully deployed. The secrets related with passowrd for SQL logins present the following naming structure:
In case the secret is not manually created, the dployment in Azure DevOps will use an initial invalid value. The SQL login will not be created if this initial value is present, but only when the latest version of that value is different from the initial value. It will be requied to add the secret in the Key Vault for every existing environment where this Data Product is to be deployed.
Besides the Azure infrastructure deployed, several Webjobs are also deployed for the Basic SQL and Databricks Data Product, responsible for the background tasks of data and metadata synchronization:
- Sync
- DatabaseBuilder
- DatafactoryManager
In order to copy the data from the Entities in the DSU, the Data Product needs to request a token to the Identity Server service. This token will only be valid for a restricted time frame. If the validity period of such token needs to be extended, the Identity Server database in the Sidra Service resource group allows to configure such setting. In order to do this, we need to increase the validity period of the tokens by doing the following:
Apply an UPDATE over the table [dbo].[Clients]
, extendign the value of the field [AccessTokenLifetime]
for DataMesh. For example, to extend to 5 hours:
Data Product pipelines¶
Section Data Product pipelines includes information on the available Data Product pipelines to be used for this Data Product. See Default pipeline template for extraction for details on this pipeline, parameters, etc.