Sidra Data Products main concepts¶
The Data Products are the pieces of Sidra enabling to drive business cases. The Data Products layer can be also described as a transformation and serving layer for Sidra Data Platform.
Some characteristics of the Data Products are:
-
Data Products (or Apps) are a set of Azure resources and code, enclosed in a Resource Group, which either access the data from one or multiple Data Storage Units via the secure APIs, or retrieve the data from the Data Lake, applying business transformations if needed.
-
Any actor that needs to access the data stored in the Data Lake for a specific business need is catalogued as a Data Product.
-
Data Products consume the content from the Data Storage Units, it is mandatory then to know if new content has been ingested in order to extract it and incorporate to the Data Product storage.
-
The architecture allows for Data Products to be built using any set of tools and components, so they can range from Power BI-enabled analytical workspaces to even full-blown Web apps.
-
All Data Products must use the same security model, notifications and logging infrastructure, etc., via Sidra APIs.
-
This centralization and templatization of governance and transversal components is what makes Sidra Data Platform an accelerator for innovation on serving data use cases.
Versions
In some places the terminology Consumer apps to refer to Data Products may still be found. This term is now deprecated
Overview of the Data Product creation and deployment processes¶
Sidra Data Product pipelines are designed to perform installation and deployment of the Data Products in a highly automated way, following a Continuous Integration/Continuous Deployment (CI/CD) process.
The general process is as follows:
- Step 1: We need to download locally the corresponding Data Product template.
- Step 2: The source for the actual instance of the Data Product needs to be created from this template.
- Step 3: Once the .NET solution has been obtained with its code projects, we need to push this code to a specific branch (depending on the environment) on an Azure DevOps repository. The repository and branches need to have been created previously if they did not exist yet.
- Step 4: Finally, the solution will be deployed with CI/CD pipelines.
- For this, the Data Product generates a build+release definition in YAML format.
- This YAML definition will be used by Azure DevOps to configure the integration and deployment processes.
- A set of artifacts will be required for this, that may vary according to the type of Data Product template that is used. This artifacts contain mainly required parameters for creating and deploying the infrastructure of the Data Product in Azure. There are mainly two alternatives:
- Data configuration files (.psd1)
- Variable groups in DevOps, the recommended option for new templates creation, as this is compatible with the plugins approach for Data Products.
More details of the plugins approach in the Connectors documentation pages.
Step-by-step
Create a Data Product from scratch
For more information, check the specific tutorial for creating a Data Product .
Data movement and orchestration in Data Products¶
The Data Products solutions use Azure Data Factory V2 (ADF) to perform data movements -same as Sidra Sidra Service solutions-, but the deployment project creates its own instance of ADF in the Resource Group of the Data Product.
When working with both solutions at the same time -Sidra Service and Data Products- it is important to differentiate the ADF instances. More specifically, there will be:
- an ADF instance for each Data Storage Unit (DSU) in Sidra Service.
- an ADF instance for each Data Product solution.
Job DataFactoryManager for Data Products¶
ADF components (datasets, triggers and pipelines) in Data Products are managed the same way than in Sidra Service, by means of the DataFactoryManager
webjob.
The section Data Factory tables explains how it works in Sidra Service.
The DataFactoryManager
job uses information stored in the metadata database to build and programmatically create the ADF components in Data Factory, which means that Data Products need a metadata database to store the information about ADF components.
There are some minor differences between DataFactoryManager
for Data Products and for Sidra Service:
- Sidra Service version includes the creation of the landing zones -creation of the Azure Storage containers- for the Data Storage Units.
- The Pipeline table in Sidra Service can store ADF pipelines but also Azure Search pipelines. In Data Products the Pipeline table only stores ADF pipelines.
DataFactoryManager
in Data Products has to filter the pipelines to create only those needed for ADF.
Understanding the synchronization between Sidra Sidra Service and Data Products¶
Sidra provides templates to create Data Products that can be configured to automatically retrieve the information from a Data Storage Unit (DSU) when it is available.
The process of discovering and extracting this information is called synchronization between a Data Product and Sidra Service.
Sidra provides a component that orchestrates this synchronization, the Sync
webjob.
This job is deployed in all Data Products with the template provided by Sidra. Without any additional configuration, the metadata in the Sidra Service database is already synchronized with the metadata in the Data Product database.
The synchronization will take place as long as the Data Product has the right permissions to synchronize with the data in the DSU. This is done by editing the Balea Authorization permissions tables (Users >Data Product Subject > Permissions
).
The synchronization webjob is configured to be executed every 2 minutes.
How naming conventions for Data Product staging tables work¶
Sidra supports two different naming conventions for the Databricks and the Data Product staging tables:
- The Sidra default naming convention. This naming convention is "databasename_schemaname_tablename". So, for example: for a table whose name is "table1" in the origin system, which is under a database "databaseA", and under a schema "schemaA", then the resulting name with this convention will be: "databaseA_schemaA_table1".
- A custom naming convention through a couple of configuration parameters.
Please find below instructions on how to use these configuration parameters to specify this custom naming convention.
Steps of configuration for naming convention¶
Depending on the metadata scenario, different options of configuration are possible, as show below:
-
For existing metadata that do not need to be updated (nor new metadata added): no action is required.
-
For existing metadata that need to be updated or new metadata added, wanting to maintain the current names (so, not use the Sidra default naming convention): you need to enable the new Sidra custom naming convention.
For this, you need to set the parameter
EnableCustomTableNamePrefix
totrue
, and also set the parameterCustomTableNamePrefix
accordingly, with the prefix you want to use. The prefix is anything in the name that goes before the name of the table. For example:- If your staging table name is "tableA", the prefix needs to be null.
- If your staging table is "schema_name.table_name_", then the prefix needs to be set to the placeholder "{SCHEMA_NAME}_".
- If your staging table is "database_name.table_name", then the prefix needs to be set to the placeholder "{DATABASE_NAME}_".
Examples of use
Independently of the value set to the parameter
CustomTableNamePrefix
, the names will be the new default naming convention provided by Sidra: "databasename_schemaname_tablename".Then you can use either a fixed value for the naming convention, or a placeholder:
-
If using a fixed value to concatenate fixed prefixes to the actual table name, then you need to populate the field
CustomTableNamePrefix
with a string value. This will concat the prefix configured incustomTableNamePrefix
to the actual name of the table. For example:- If
CustomTableNamePrefix
is empty, the final name for the staging table of a table with name "table1" in origin, will be "table1". - If
CustomTableNamePrefix
="databaseA_", then the final name for the staging table of a table with name "table1" in origin, will be "databaseA_table1". - If
CustomTableNamePrefix
="schemaA_", then the final name for the staging table of a table with name "table1" in origin, will be "schemaA_table1".
- If
-
If using a placeholder value to concatenate a dynamic prefix to the actual table name, then you need to populate the field
CustomTableNamePrefix
with a placeholder string value. For example:- If
CustomTableNamePrefix
="{SCHEMA_NAME}_", then the final name for the staging table of a table in origin with name "table1, under schema "schemaA" will be: "schemaA_table1". - If
CustomTableNamePrefix
="{DATABASE_NAME}_", then the final name for the staging table of a table in origin with name "table" under the database "databaseA" will be: "databaseA_table1".
- If
-
For existing metadata that just need to be updated, you can also use the new Sidra default naming convention: "databasename_schemaname_tablename". In this case, some manual intervention is required:
- You need to update the stored procedures at the Data Product side, that reference these staging tables, as their names will have changed.
- You also need to consolidate Databricks tables (new Databricks tables will have been created with this naming convention after the naming convention change is applied; this means otherwise you would have different tables referencing the same origin data).
Identity Server token configuration¶
In order to copy the data from the Entities in the DSU, the Data Product needs to request a token to the Identity Server service. This token will only be valid for a restricted time frame. If the validity period of such token needs to be extended, the Identity Server database in the Sidra Service resource group allows to configure such setting. In order to do this, we need to increase the validity period of the tokens by doing the following:
Apply an UPDATE over the table [dbo].[Clients]
, extendign the value of the field [AccessTokenLifetime]
for the respective Data Product. For example, to extend to 5 hours:
where ClientAppName is the ClientName of the corresponding Data Product.