Basic SQL and Databricks Client Application¶
To enable the transformation and modelling of relational data according to business rules Sidra provides a template of Client Application, called Basic SQL Client Application.
The Basic SQL and Databricks Client Application extends from the Basic SQL Client Application to allow for additional requirements. These additional requirements cater to more advanced scenarios, such as:
- Scenarios where we need to allow custom data querying logic on the raw data lake from the Client Application.
- Scenarios where we need advanced data aggregation and dependencies logic as views or new staging tables in the Client Application.
Sidra provides a built-in Client Application template as an extension of the basic SQL Client Application. This template incorporates a dedicated Databricks cluster for enabling this custom querying logic to create staging tables as e.g. new views.
The main components of this Client Application template are:
- A client SQL database. This database hosts a reduced version of the metadata tables in Sidra Core (in order to track Assets metadata and ADF metadata), as well as the staging tables and stored procedures to create the final elaborated production tables.
- A Databricks cluster that is deployed and configured with an orchestration notebook. This notebook orchestrates the required custom data processing logic, for example, to query for an Entity and its related Entities to create a staging table as an aggregated view, etc.
This document refers to key concepts of a Client Application in Sidra, which can be reviewed here.
This Client Application template allows to accelerate the creation of a Client Application, by abstracting from all the main synchronization elements with Sidra Core.
The Client Application with this template needs to be configured to have the required permissions to access the DSU data.
Once that is happening, the Client Application Sync job, explained here, is responsible to transparently synchronize the metadata between Sidra Core and the Client Application database. This job also triggers the Client Application pipeline defined to synchronize the actual data to the local Databricks cluster and to create the Staging tables.
The actual data flow orchestration is performed by a Client Application pipeline, via a specific instance of Azure Data Factory installed in the Client Application resource group. Related to this orchestrator activity, a new option has been included in the version 1.12 onwards, aimed to manage when the stored procedure is empty, or create a new template without the orchestrator.
The Client Application pipeline is described in section Client Application pipelines. This section includes information on the available Client Application pipeline to be used for this Client Application. See here for details on this pipeline, parameters, etc.
In summary, this Client Application pipeline performs the following actions:
- Export the raw data from the DSU to the Client Application storage (raw storage container).
- Copy the data to the Client Application Databricks as delta tables.
- An orchestrator Databricks notebook executes the configured queries in
StagingConfigurationto create the required staging tables with the needed views or aggregations.
- Create views as new staging tables in the Client Application database.
- An orchestrator SQL stored procedure executes business logic to create the final production tables in the Client Application database.
This Client Application integrated with Sidra shares the common security model with Sidra Core and uses Identity Server for authentication.
In summary, the end-to-end process looks like the following:
- Thanks to the execution of the Sync webjob inside the Client Application, the ingested Assets metadata is synchronized with the Sidra Core Assets metadata.
- A copy of the relevant actual data is stored in the Client Application storage in raw and in delta tables in the Client Application Databricks.
- An orchestrator Databricks notebooks executes a set of configured queries to create the staging tables in the Client Application database.
- Further SQL stored procedures create the final production tables for the relevant use case.
High-level installation details¶
As with any other type of Client Application in Sidra, the process of installing this Client Application consists of the following main steps:
- A dotnet template is installed, which launches a build and release pipeline in Azure DevOps defined for the Data Labs Client Application.
- As part of the build and release pipeline for this Client Application, the needed infrastructure is installed. This includes the execution of the Deploy.ps1 and Databricks deployment scripts, and also the different WebJobs deployment.
Build+Release is performed with multi-stage pipelines. Once the Client Application template is installed, you have to create the solution using that template, push the changes in Git and configure the build/release pipeline in DevOps to start the deployment.
The Basic SQL and Databricks Client Application resources are contained into a couple of resource groups:
- One resource group contains all the services used by the Client Application, separated from the Sidra Core and DSU resource groups:
The services included in the ARM template for this Client Application contain the following pieces:
- Client storage account for raw data: used for storing the copy of the data that is extracted from the DSU, and for which the Client Application has access.
- Client storage account for delta tables: used for storing the delta tables, used as external tables by Databricks. This is an ADLS Gen 2 account.
- Client Data Factory: used for data orchestration pipelines to bring the data from the DSU, execute the Databricks orchestrator notebook, and copying the ellaborated data to the staging tables.
- Client Database: used for keeping a synchronized copy of the Assets metadata between Sidra Core and the Client App, and for hosting the relational models and transformation queries and stored procedures.
- Client Key Vault: used for storing and accessing secrets in a secure way.
One resource group is a resource group (managed resource group) that is created automatically with each Databricks service. The name of this resource group has as a prefix the same name of the above resource group, plus a suffix ending in
-dbr. A number of resources are created inside this managed resource group, such as virtual machines, disks and network interfaces.
Besides the Azure infrastructure, several Webjobs are also deployed for the Basic Client Application, responsible for the background tasks of data and metadata synchronization:
You can find more information about these jobs in this page.
Client Application pipelines¶
New Client Application tables¶
Additionally, this Client Application contains a new table, called
StagingConfiguration, which contains
the configuration of the new table names in Staging that need to be created, as well as the queries to create these new tables.
|IdPipeline||Id of the pipeline for the extraction of the data from the DSU to the Staging tables (Mandatory).|
|Query||Query that defines the way to retrieve the data from the Databricks of the Client Application. If it is not defined, an IdEntity must be specified to retrieve the content by a default query.
|IdEntity||Entity identifier to retrieve the data from the Databricks of the Client Application. By default, it retrieves all Attributes of the Entity. If it is not defined, a
|SchemaName||Schema name used for the staging tables. By default, Staging schema is used.|
|TableName||Table name for the staging table. If it is not defined, a combination with the name of the Provider and the table name of the Entity should be used.|
|TableFieldDefinition||Table field definition for the staging tale, e.g.,
StagingConfiguration table is needed in order to parametrize these queries in the Databricks notebook.
Also, there is another table, called
This table is populated by Sidra with the information of Time Travel provided by Client Databricks when the data
of the assets is inserted in Client Databricks. Please check Databricks documentation for more information on the Time Travel functionality.
|Id||Id of the Asset where Time Travel definition applies.|
|VersionNumber||Version number of the Time Travel given by Databricks.|
|Timestamp||Timestamp of the Time Travel given by Databricks.|
For a detailed example on how to use these tables and the general functioning of the Basic SQL and Databricks Client Application, please check this tutorial.
How to configure the queries to create custom staging tables¶
The Basic SQL and Databricks Client Application uses the
StagingConfiguration table to store the queries that are to be run on the Databricks of the Client Application.
When it comes to configuring the queries for creating the staging tables, we can find two different scenarios:
Scenario 1. Create a staging table for an Entity using the default configuration¶
In this first scenario, we use the default configuration of the
StagingConfiguration table to configure the creation of the staging tables:
By reviewing the above reference table that describes each field, we see that we just need to configure the
Queryfield is NULL, which means that the default query is executed. This default query just retrieves all the Attributes of the given Entity in the
When the default query is executed by the underlying process in the Client Application, the result is the following:
- A staging table is created, with
Stagingas schema. The table name is a combination of the
- This Staging table is populated with data, by executing this query.
Below is an example of a generic query that the underlying process as triggered by the Client Application executes:
<<ASSETID.??are the placeholders that should be filled in by the process, using the list of the Assets that the pipeline is loading.
<<TIMESTAMP.support.support_SalesLT_Product>>: The process locates the timestamp for the most recent Asset represented by the Provider
Supportand the table name
support_SalesLT_Product(configured Entity), to ensure that the data is up to date.
<<ASSETID.support.support_SalesLT_Product>>: The process substitutes this with all Asset identifiers for the Provider
Support, and the table name
support_SalesLT_Product, to filter the data that is loaded by the pipeline.
Once processed, the above query would be translated to this query:
Case 2. Create a custom staging table using the combination of two Entities¶
In the second scenario, we are not using the default configuration mode for the
This means that we need to populate the following fields:
INSERT INTO [Sidra].[StagingConfiguration] ([IdPipeline], [Query], [IdEntity], [SchemaName], [TableName], [TableFieldDefinition]) VALUES , (@IdPipelineDSUExtraction , 'SELECT H.SalesOrderId, H.OrderDate, H.SalesOrderNumber, H.ShipMethod, H.SubTotal, D.SalesOrderDetailId, D.OrderQty, D.UnitPrice, D.LineTotal FROM (SELECT * FROM support.support_SalesLT_SalesOrderHeader TIMESTAMP AS OF "<<TIMESTAMP.support.support_SalesLT_SalesOrderHeader>>") H LEFT JOIN (SELECT * FROM support.support_SalesLT_SalesOrderDetail TIMESTAMP AS OF "<<TIMESTAMP.support.support_SalesLT_SalesOrderDetail>>") D ON H.SalesOrderId = D.SalesOrderId WHERE H.IdSourceItem IN (<<ASSETID.support.support_SalesLT_SalesOrderHeader>>)' , NULL , NULL , 'support_SalesLT_SalesOrderHeaderWithDetails' , '[SalesOrderID] [int] NULL,[OrderDate] [datetime] NULL,[SalesOrderNumber] [nvarchar](25) NULL,[ShipMethod] [nvarchar](50) NULL,[SubTotal] [money] NULL,[SalesOrderDetailID] [int] NULL,[OrderQty] [smallint] NULL,[UnitPrice] [money] NULL,[LineTotal] [numeric](18, 0) NULL')
When this query is executed by the underlying process in the Client Application, the result is the following:
- A staging table is created, with
Stagingas a schema (default when the value
SchemaNameis not specified). The table name is speficied by the value in
TableName, and the columns for the table are defined by the
- The staging table is populated with data, by executing the query defined in the
SELECT H.SalesOrderId, H.OrderDate, H.SalesOrderNumber, H.ShipMethod, H.SubTotal, D.SalesOrderDetailId, D.OrderQty, D.UnitPrice, D.LineTotal FROM (SELECT * FROM support.support_SalesLT_SalesOrderHeader TIMESTAMP AS OF "<<TIMESTAMP.support.support_SalesLT_SalesOrderHeader>>") H LEFT JOIN (SELECT * FROM support.support_SalesLT_SalesOrderDetail TIMESTAMP AS OF "<<TIMESTAMP.support.support_SalesLT_SalesOrderDetail>>") D ON H.SalesOrderId = D.SalesOrderId WHERE H.IdSourceItem IN (<<ASSETID.support.support_SalesLT_SalesOrderHeader>>)
<<ASSETID.??>>are placeholders that should be filled by the process, according the Assets that the pipeline is loading.
- For the creation of custom queries the placeholders available are: