Data Domain¶

To enable the transformation and modelling of relational data according to business rules Sidra provides a template of Data Product, named Data Domain. The Data Domain Data Product includes the data synchronization towards Databricks and Azure SQL:

A Databricks cluster. Stores the data from Sidra merged by default. The data can also be processed according to the business rules to create the final production tables or prepare the data for transferring to the staging area of the Azure SQL database.
An Azure SQL database. The metadata tables from Sidra are stored in this database. It also contains the staging tables and the stored procedures that generate the final production tables after processing. Both approaches utilize a designated file to manage the application of business logic to the data. In Databricks, this process is facilitated by a notebook, while in Azure SQL, it is accomplished through a stored procedure.

Purpose¶

This Data Product template allows to accelerate the creation of a Data Domain, by abstracting from all the main synchronization elements with Sidra Service.

The Data Domain needs to be configured to have the required permissions to access the DSU data.

The actual data flow orchestration is performed by a Data Sync, via a specific instance of Azure Data Factory installed in the Data Product resource group. The end-to-end process looks like the following:

A copy of the relevant actual data is stored in the Data Product storage in raw and in delta tables in the Data Product Databricks.
A Databricks notebooks executes a set of configured queries to create the staging tables in the Data Product database.
Databricks and Azure SQL both serve the function of handling and transforming the data by implementing business rules to produce the ultimate dataset intended for the Data Domain use.

Step-by-step

Configure permissions for Data Product to DSU data

For more information, check the specific tutorial for configuring permissions in Data Product to access the DSU data.

Configure Data Sync

For more information, check the specific tutorial for configuring Data Sync through Sidra Web.

Deploy a Data Product

For more information, check the specific tutorial for deploying a Data Domain Data Product.

Dive deeper in Sync Mode and configuration of staging tables

Architecture¶

The Data Domain Data Product resources are contained into a couple of resource groups:

One is created by the Data Domain itself and contains the following pieces:
- Storage account for raw data: used for storing the copy of the data that is extracted from the DSU, and for which the Data Product has access.
- Storage account for delta tables: used for storing the delta tables, used as external tables by Databricks. This is an ADLS Gen 2 account.
- Data Factory: used for data synchronization, retrieving data from the DSU. It executes the Databricks Orchestrator notebook when necessary, copying the required data to the staging tables, and, if applicable, runs the Orchestrator stored procedure.
- Azure SQL Database: used for keeping a synchronized copy of the Assets metadata between Sidra and the Data Product, and for hosting the relational models, transformation queries and stored procedures.
- Key Vault: used for storing and accessing secrets in a secure way.
- Container App: used for the Data Product API, includes a Container job that executes the synchronization from the DSU(s) to the Data Product.
- Databricks: used for storing optimized delta tables, which contain data processed from raw storage.
One managed resource group that is created automatically with each Databricks resource. The name of this resource group has as a prefix the same name of the above resource group, plus a suffix starting in -dsu and ending with the name of the Databricks created in the previous resource group. A number of resources are created inside this managed resource group, such as virtual machines, disks and network interfaces, all of them managed by the Databricks resource.

Subnets validation¶

The validation process for a Data Product will be different depending on the VNet configuration used.

Default VNet: When using the default Sidra's VNet, the required Azure resources will be automatically created following a naming convention.
Injected Vnet: When Sidra was installed in an existing VNet, the subnet names provided by the user have to be previously created. The validation process will ensure the subnets already exist but it will not check the Network resource groups or configuration used.