Skip to content

Basic SQL and Databricks Client Application

To enable the transformation and modelling of relational data according to business rules Sidra provides a template of Client Application, called Basic SQL Client Application.

The Basic SQL and Databricks Client Application extends from the Basic SQL Client Application to allow for additional requirements. These additional requirements cater to more advanced scenarios, such as:

  • Scenarios where we need to allow custom data querying logic on the raw data lake from the Client Application.
  • Scenarios where we need advanced data aggregation and dependencies logic as views or new staging tables in the Client Application.

Sidra provides a built-in Client Application template as an extension of the basic SQL Client Application. This template incorporates a dedicated Databricks cluster for enabling this custom querying logic to create staging tables as e.g. new views.

The main components of this Client Application template are:

  • A client SQL database. This database hosts a reduced version of the metadata tables in Sidra Core (in order to track Assets metadata and ADF metadata), as well as the staging tables and stored procedures to create the final elaborated production tables.
  • A Databricks cluster that is deployed and configured with an orchestration notebook. This notebook orchestrates the required custom data processing logic, for example, to query for an Entity and its related Entities to create a staging table as an aggregated view, etc.

This document refers to key concepts of a Client Application in Sidra, which can be reviewed here.

Purpose

This Client Application template allows to accelerate the creation of a Client Application, by abstracting from all the main synchronization elements with Sidra Core.

The Client Application with this template needs to be configured to have the required permissions to access the DSU data.

Once that is happening, the Client Application Sync job, explained here, is responsible to transparently synchronize the metadata between Sidra Core and the Client Application database. This job also triggers the Client Application pipeline defined to synchronize the actual data to the local Databricks cluster and to create the Staging tables.

The actual data flow orchestration is performed by a Client Application pipeline, via a specific instance of Azure Data Factory installed in the Client Application resource group. Related to this orchestrator activity, a new option has been included in the version 1.12 (2022.R2) onwards, aimed to manage when the stored procedure is empty, or create a new template without the orchestrator.

Info

The Client Application pipeline is described in section Client Application pipelines. This section includes information on the available Client Application pipeline to be used for this Client Application. See here for details on this pipeline, parameters, etc.

Summary

In summary, the Client Application pipeline performs the following actions:

  • Export the raw data from the DSU to the Client Application storage (raw storage container).
  • Copy the data to the Client Application Databricks as delta tables.
  • An orchestrator Databricks notebook executes the configured queries in StagingConfiguration to create the required staging tables with the needed views or aggregations.
  • Create views as new staging tables in the Client Application database.
  • An orchestrator SQL stored procedure executes business logic to create the final production tables in the Client Application database.
  • This Client Application integrated with Sidra shares the common security model with Sidra Core and uses Identity Server for authentication.

The end-to-end process looks like the following:

  • Thanks to the execution of the Sync webjob inside the Client Application, the ingested Assets metadata is synchronized with the Sidra Core Assets metadata.
  • A copy of the relevant actual data is stored in the Client Application storage in raw and in delta tables in the Client Application Databricks.
  • An orchestrator Databricks notebooks executes a set of configured queries to create the staging tables in the Client Application database.
  • Further SQL stored procedures create the final production tables for the relevant use case.

High-level installation details

As with any other type of Client Application in Sidra, the process of installing this Client Application consists of the following main steps:

  • A dotnet template is installed, which launches a build and release pipeline in Azure DevOps defined for the Data Labs Client Application.
  • As part of the build and release pipeline for this Client Application, the needed infrastructure is installed. This includes the execution of the Deploy.ps1 and Databricks deployment scripts, and also the different WebJobs deployment.

Build+Release is performed with multi-stage pipelines. Once the Client Application template is installed, you have to create the solution using that template, push the changes in Git and configure the build/release pipeline in DevOps to start the deployment.

Info

For more information on these topics you can access this Documentation.

Step-by-step

Create a Client App from scratch

For more information, check the specific tutorial for creating a Client App .

Architecture

The Basic SQL and Databricks Client Application resources are contained into a couple of resource groups:

  • One resource group contains all the services used by the Client Application, separated from the Sidra Core and DSU resource groups:
  • The services included in the ARM template for this Client Application contain the following pieces:

    • Client storage account for raw data: used for storing the copy of the data that is extracted from the DSU, and for which the Client Application has access.
    • Client storage account for delta tables: used for storing the delta tables, used as external tables by Databricks. This is an ADLS Gen 2 account.
    • Client Data Factory: used for data orchestration pipelines to bring the data from the DSU, execute the Databricks orchestrator notebook, and copying the ellaborated data to the staging tables.
    • Client Database: used for keeping a synchronized copy of the Assets metadata between Sidra Core and the Client App, and for hosting the relational models and transformation queries and stored procedures.
    • Client Key Vault: used for storing and accessing secrets in a secure way.
  • One resource group is a resource group (managed resource group) that is created automatically with each Databricks service. The name of this resource group has as a prefix the same name of the above resource group, plus a suffix ending in -dbr. A number of resources are created inside this managed resource group, such as virtual machines, disks and network interfaces.

Basic-SQL-Databricks

Besides the Azure infrastructure, several Webjobs are also deployed for the Basic Client Application, responsible for the background tasks of data and metadata synchronization:

  • Sync webjob
  • DatabaseBuilder webjob
  • DatafactoryManager webjob

Info

You can find more information about these jobs in this page.

Client Application pipelines

Section Client Application pipelines includes information on the available Client Application pipelines to be used for this Client Application. See here for details on this pipeline, parameters, etc.

New Client Application tables

Additionally, this Client Application contains a new table, called StagingConfiguration, which contains the configuration of the new table names in Staging that need to be created, as well as the queries to create these new tables.

Column Description
IdPipeline Id of the pipeline for the extraction of the data from the DSU to the Staging tables (Mandatory).
Query Query that defines the way to retrieve the data from the Databricks of the Client Application. If it is not defined, an IdEntity must be specified to retrieve the content by a default query. IdEntity and Query cannot be defined both as NULL at the same time.
IdEntity Entity identifier to retrieve the data from the Databricks of the Client Application. By default, it retrieves all Attributes of the Entity. If it is not defined, a Query must be specified to know the way to retrieve the data. IdEntity and Query cannot be both NULL at the same time.
SchemaName Schema name used for the staging tables. By default, Staging schema is used.
TableName Table name for the staging table. If it is not defined, a combination with the name of the Provider and the table name of the Entity should be used.
TableFieldDefinition Table field definition for the staging tale, e.g., [SalesOrderID] [int] NULL,[OrderDate] [datetime] NULL,[SalesOrderNumber] [nvarchar](25) UnitPrice] [money] NULL. If it is not defined, the whole list of Attributes of the Entity is used. TableFieldDefinition and IdEntity cannot be defined as both NULL at the same time.

The StagingConfiguration table is needed in order to parametrize these queries in the Databricks notebook.

Also, there is another table, called AssetTimeTravel. This table is populated by Sidra with the information of Time Travel provided by Client Databricks when the data of the assets is inserted in Client Databricks. Please check Databricks documentation for more information on the Time Travel functionality.

Column Description
Id Id of the Asset where Time Travel definition applies.
VersionNumber Version number of the Time Travel given by Databricks.
Timestamp Timestamp of the Time Travel given by Databricks.

For a detailed example on how to use these tables and the general functioning of the Basic SQL and Databricks Client Application, please check this tutorial.

How to configure the queries to create custom staging tables

The Basic SQL and Databricks Client Application uses the StagingConfiguration table to store the queries that are to be run on the Databricks of the Client Application. When it comes to configuring the queries for creating the staging tables, we can find two different scenarios:

Scenario 1. Create a staging table for an Entity using the default configuration

In this first scenario, we use the default configuration of the StagingConfiguration table to configure the creation of the staging tables: By reviewing the above reference table that describes each field, we see that we just need to configure the IdEntity field:

INSERT INTO [Sidra].[StagingConfiguration] ([IdPipeline], [Query], [IdEntity], [SchemaName], [TableName], [TableFieldDefinition]) VALUES
(@IdPipelineDSUExtraction, NULL, @IdEntity, NULL, NULL, NULL)
As you will notice, the Query field is NULL, which means that the default query is executed. This default query just retrieves all the Attributes of the given Entity in the IdEntity field.

When the default query is executed by the underlying process in the Client Application, the result is the following:

  • A staging table is created, with Staging as schema. The table name is a combination of the Provider name and Entity name, e.g. [Staging].[Support_support_SalesLT_Product]
  • This Staging table is populated with data, by executing this query.

Below is an example of a generic query that the underlying process as triggered by the Client Application executes:

SELECT ProductID,Name,LoadDate,IdSourceItem 
FROM (
SELECT ProductID,Name,LoadDate,IdSourceItem FROM support.support_SalesLT_Product TIMESTAMP AS OF "<<TIMESTAMP.support.support_SalesLT_Product>>"
) A WHERE A.IdSourceItem IN (<<ASSETID.support.support_SalesLT_Product>>)

, where:

  • <<TIMESTAMP.??>> and <<ASSETID.?? are the placeholders that should be filled in by the process, using the list of the Assets that the pipeline is loading.
  • <<TIMESTAMP.support.support_SalesLT_Product>>: The process locates the timestamp for the most recent Asset represented by the Provider Support and the table name support_SalesLT_Product (configured Entity), to ensure that the data is up to date.
  • <<ASSETID.support.support_SalesLT_Product>>: The process substitutes this with all Asset identifiers for the Provider Support, and the table name support_SalesLT_Product, to filter the data that is loaded by the pipeline.

Once processed, the above query would be translated to this query:

SELECT ProductID,Name,LoadDate,IdSourceItem 
FROM (
SELECT ProductID,Name,LoadDate,IdSourceItem FROM support.support_SalesLT_Product TIMESTAMP AS OF "2021-12-02 11:38:42.0000000"
) A WHERE A.IdSourceItem IN (1399,1402,1409,1416)

Case 2. Create a custom staging table using the combination of two Entities

In the second scenario, we are not using the default configuration mode for the StagingConfiguration table. This means that we need to populate the following fields: Query, TableName and TableFieldDefinition.

INSERT INTO [Sidra].[StagingConfiguration] ([IdPipeline], [Query], [IdEntity], [SchemaName], [TableName], [TableFieldDefinition]) VALUES
, (@IdPipelineDSUExtraction
, 'SELECT H.SalesOrderId, H.OrderDate, H.SalesOrderNumber, H.ShipMethod, H.SubTotal, D.SalesOrderDetailId, D.OrderQty, D.UnitPrice, D.LineTotal
   FROM (SELECT * FROM support.support_SalesLT_SalesOrderHeader TIMESTAMP AS OF "<<TIMESTAMP.support.support_SalesLT_SalesOrderHeader>>") H
   LEFT JOIN (SELECT * FROM support.support_SalesLT_SalesOrderDetail TIMESTAMP AS OF "<<TIMESTAMP.support.support_SalesLT_SalesOrderDetail>>") D ON H.SalesOrderId = D.SalesOrderId 
   WHERE H.IdSourceItem IN (<<ASSETID.support.support_SalesLT_SalesOrderHeader>>)'
, NULL 
, NULL
, 'support_SalesLT_SalesOrderHeaderWithDetails'
, '[SalesOrderID] [int] NULL,[OrderDate] [datetime] NULL,[SalesOrderNumber] [nvarchar](25) NULL,[ShipMethod] [nvarchar](50) NULL,[SubTotal] [money] NULL,[SalesOrderDetailID] [int] NULL,[OrderQty] [smallint] NULL,[UnitPrice] [money] NULL,[LineTotal] [numeric](18, 0) NULL')

When this query is executed by the underlying process in the Client Application, the result is the following:

  • A staging table is created, with Staging as a schema (default when the value SchemaName is not specified). The table name is speficied by the value in TableName, and the columns for the table are defined by the TableFieldDefinition column.
  • The staging table is populated with data, by executing the query defined in the Query column:
SELECT H.SalesOrderId, H.OrderDate, H.SalesOrderNumber, H.ShipMethod, H.SubTotal, D.SalesOrderDetailId, D.OrderQty, D.UnitPrice, D.LineTotal 
FROM (SELECT * FROM support.support_SalesLT_SalesOrderHeader TIMESTAMP AS OF "<<TIMESTAMP.support.support_SalesLT_SalesOrderHeader>>") H 
LEFT JOIN (SELECT * FROM support.support_SalesLT_SalesOrderDetail TIMESTAMP AS OF "<<TIMESTAMP.support.support_SalesLT_SalesOrderDetail>>") D ON H.SalesOrderId = D.SalesOrderId 
WHERE H.IdSourceItem IN (<<ASSETID.support.support_SalesLT_SalesOrderHeader>>)

, where:

  • <<TIMESTAMP.??>> and <<ASSETID.??>> are placeholders that should be filled by the process, according the Assets that the pipeline is loading.
  • For the creation of custom queries the placeholders available are: <<TIMESTAMP.{provider-databaseName}.{entity-tableName}>> <<ASSETID.{provider-databaseName}.{entity-tableName}>>


Sidra Ideas Portal


Last update: 2022-07-14
Back to top