How to configure the queries to create custom staging tables¶

The Data Product uses the StagingConfiguration table to store the queries that are to be run on the Databricks of the Data Product. When it comes to configuring the queries for creating the staging tables, we can find two different scenarios:

Scenario 1. Create a staging table for an Entity using the default configuration¶

In this first scenario, we use the default configuration of the StagingConfiguration table to configure the creation of the staging tables: By reviewing the above reference table that describes each field, we see that we just need to configure the IdEntity field:

INSERT INTO [Sidra].[StagingConfiguration] ([IdPipeline], [Query], [IdEntity], [SchemaName], [TableName], [TableFieldDefinition]) VALUES
(@IdPipelineDSUExtraction, NULL, @IdEntity, 'support_SalesLT_Product', NULL, NULL)

As you will notice, the Query field is NULL, which means that the default query is executed. This default query just retrieves all the Attributes of the given Entity in the IdEntity field.

When the default query is executed by the underlying process in the Data Product, the result is the following:

A Databricks table is created under the corresponding Provider database with provided name.
A SQL staging table is created, with Staging as schema. The table name is specified by the value in TableName.
This staging table is populated with data, by executing this query.

When this configuration is created automatically from Sidra's UI, if there is an existing table with the combination of Provider name and Entity name, e.g. [Staging].[Support_support_SalesLT_Product], the Data Sync name will be appended at the end to allow table differentiation.

Below is an example of a generic query that the underlying process as triggered by the Data Product executes:

SELECT ProductID,Name,LoadDate,SidraIdAsset 
FROM (
SELECT ProductID,Name,LoadDate,SidraIdAsset FROM support.support_SalesLT_Product TIMESTAMP AS OF "<<TIMESTAMP.support.support_SalesLT_Product>>"
) A WHERE A.SidraIdAsset IN (<<ASSETID.support.support_SalesLT_Product>>)

, where:

<<TIMESTAMP.??>> and <<ASSETID.?? are the placeholders that should be filled in by the process, using the list of the Assets that the pipeline is loading.
<<TIMESTAMP.support.support_SalesLT_Product>>: The process locates the timestamp for the most recent Asset represented by the Provider Support and the table name support_SalesLT_Product (configured Entity), to ensure that the data is up to date.
<<ASSETID.support.support_SalesLT_Product>>: The process substitutes this with all Asset identifiers for the Provider Support, and the table name support_SalesLT_Product, to filter the data that is loaded by the pipeline.

Once processed, the above query would be translated to this query:

SELECT ProductID,Name,LoadDate,SidraIdAsset 
FROM (
SELECT ProductID,Name,LoadDate,SidraIdAsset FROM support.support_SalesLT_Product TIMESTAMP AS OF "2021-12-02 11:38:42.0000000"
) A WHERE A.SidraIdAsset IN (1399,1402,1409,1416)

Scenario 2. Create a custom staging table using the combination of two Entities¶

In the second scenario, we are not using the default configuration mode for the StagingConfiguration table. This means that we need to populate the following fields: Query, TableName and TableFieldDefinition.

INSERT INTO [Sidra].[StagingConfiguration] ([IdPipeline], [Query], [IdEntity], [SchemaName], [TableName], [TableFieldDefinition]) VALUES
, (@IdPipelineDSUExtraction
, 'SELECT H.SalesOrderId, H.OrderDate, H.SalesOrderNumber, H.ShipMethod, H.SubTotal, D.SalesOrderDetailId, D.OrderQty, D.UnitPrice, D.LineTotal
   FROM (SELECT * FROM support.support_SalesLT_SalesOrderHeader TIMESTAMP AS OF "<<TIMESTAMP.support.support_SalesLT_SalesOrderHeader>>") H
   LEFT JOIN (SELECT * FROM support.support_SalesLT_SalesOrderDetail TIMESTAMP AS OF "<<TIMESTAMP.support.support_SalesLT_SalesOrderDetail>>") D ON H.SalesOrderId = D.SalesOrderId 
   WHERE H.SidraIdAsset IN (<<ASSETID.support.support_SalesLT_SalesOrderHeader>>)'
, NULL 
, NULL
, 'support_SalesLT_SalesOrderHeaderWithDetails'
, '[SalesOrderID] [int] NULL,[OrderDate] [datetime] NULL,[SalesOrderNumber] [nvarchar](25) NULL,[ShipMethod] [nvarchar](50) NULL,[SubTotal] [money] NULL,[SalesOrderDetailID] [int] NULL,[OrderQty] [smallint] NULL,[UnitPrice] [money] NULL,[LineTotal] [numeric](18, 0) NULL')

When this query is executed by the underlying process in the Data Product, the result is the following:

A staging table is created, with Staging as a schema (default when the value SchemaName is not specified). The table name is specified by the value in TableName, and the columns for the table are defined by the TableFieldDefinition column.
The staging table is populated with data, by executing the query defined in the Query column:

SELECT H.SalesOrderId, H.OrderDate, H.SalesOrderNumber, H.ShipMethod, H.SubTotal, D.SalesOrderDetailId, D.OrderQty, D.UnitPrice, D.LineTotal 
FROM (SELECT * FROM support.support_SalesLT_SalesOrderHeader TIMESTAMP AS OF "<<TIMESTAMP.support.support_SalesLT_SalesOrderHeader>>") H 
LEFT JOIN (SELECT * FROM support.support_SalesLT_SalesOrderDetail TIMESTAMP AS OF "<<TIMESTAMP.support.support_SalesLT_SalesOrderDetail>>") D ON H.SalesOrderId = D.SalesOrderId 
WHERE H.SidraIdAsset IN (<<ASSETID.support.support_SalesLT_SalesOrderHeader>>)

, where:

<<TIMESTAMP.??>> and <<ASSETID.??>> are placeholders that should be filled by the process, according the Assets that the pipeline is loading.
For the creation of custom queries the placeholders available are: <<TIMESTAMP.{provider-databaseName}.{stagingConfiguration-tableName}>> <<ASSETID.{provider-databaseName}.{stagingConfiguration-tableName}>>