How to configure the queries to create custom staging tables¶
The Data Product uses the StagingConfiguration
table to store the queries that are to be run on the Databricks of the Data Product. When it comes to configuring the queries for creating the staging tables, we can find two different scenarios:
Scenario 1. Create a staging table for an Entity using the default configuration¶
In this first scenario, we use the default configuration of the StagingConfiguration
table to configure the creation of the staging tables: By reviewing the above reference table that describes each field, we see that we just need to configure the IdEntity
field:
INSERT INTO [Sidra].[StagingConfiguration] ([IdPipeline], [Query], [IdEntity], [SchemaName], [TableName], [TableFieldDefinition]) VALUES
(@IdPipelineDSUExtraction, NULL, @IdEntity, 'support_SalesLT_Product', NULL, NULL)
Query
field is NULL, which means that the default query is executed. This default query just retrieves all the Attributes of the given Entity in the IdEntity
field. When the default query is executed by the underlying process in the Data Product, the result is the following:
- A Databricks table is created under the corresponding Provider database with provided name.
- A SQL staging table is created, with
Staging
as schema. The table name is specified by the value inTableName
. - This staging table is populated with data, by executing this query.
When this configuration is created automatically from Sidra's UI, if there is an existing table with the combination of Provider
name and Entity
name, e.g. [Staging].[Support_support_SalesLT_Product]
, the Data Sync name will be appended at the end to allow table differentiation.
Below is an example of a generic query that the underlying process as triggered by the Data Product executes:
SELECT ProductID,Name,LoadDate,SidraIdAsset
FROM (
SELECT ProductID,Name,LoadDate,SidraIdAsset FROM support.support_SalesLT_Product TIMESTAMP AS OF "<<TIMESTAMP.support.support_SalesLT_Product>>"
) A WHERE A.SidraIdAsset IN (<<ASSETID.support.support_SalesLT_Product>>)
, where:
<<TIMESTAMP.??>>
and<<ASSETID.??
are the placeholders that should be filled in by the process, using the list of the Assets that the pipeline is loading.<<TIMESTAMP.support.support_SalesLT_Product>>
: The process locates the timestamp for the most recent Asset represented by the ProviderSupport
and the table namesupport_SalesLT_Product
(configured Entity), to ensure that the data is up to date.<<ASSETID.support.support_SalesLT_Product>>
: The process substitutes this with all Asset identifiers for the ProviderSupport
, and the table namesupport_SalesLT_Product
, to filter the data that is loaded by the pipeline.
Once processed, the above query would be translated to this query:
SELECT ProductID,Name,LoadDate,SidraIdAsset
FROM (
SELECT ProductID,Name,LoadDate,SidraIdAsset FROM support.support_SalesLT_Product TIMESTAMP AS OF "2021-12-02 11:38:42.0000000"
) A WHERE A.SidraIdAsset IN (1399,1402,1409,1416)
Scenario 2. Create a custom staging table using the combination of two Entities¶
In the second scenario, we are not using the default configuration mode for the StagingConfiguration
table. This means that we need to populate the following fields: Query
, TableName
and TableFieldDefinition
.
INSERT INTO [Sidra].[StagingConfiguration] ([IdPipeline], [Query], [IdEntity], [SchemaName], [TableName], [TableFieldDefinition]) VALUES
, (@IdPipelineDSUExtraction
, 'SELECT H.SalesOrderId, H.OrderDate, H.SalesOrderNumber, H.ShipMethod, H.SubTotal, D.SalesOrderDetailId, D.OrderQty, D.UnitPrice, D.LineTotal
FROM (SELECT * FROM support.support_SalesLT_SalesOrderHeader TIMESTAMP AS OF "<<TIMESTAMP.support.support_SalesLT_SalesOrderHeader>>") H
LEFT JOIN (SELECT * FROM support.support_SalesLT_SalesOrderDetail TIMESTAMP AS OF "<<TIMESTAMP.support.support_SalesLT_SalesOrderDetail>>") D ON H.SalesOrderId = D.SalesOrderId
WHERE H.SidraIdAsset IN (<<ASSETID.support.support_SalesLT_SalesOrderHeader>>)'
, NULL
, NULL
, 'support_SalesLT_SalesOrderHeaderWithDetails'
, '[SalesOrderID] [int] NULL,[OrderDate] [datetime] NULL,[SalesOrderNumber] [nvarchar](25) NULL,[ShipMethod] [nvarchar](50) NULL,[SubTotal] [money] NULL,[SalesOrderDetailID] [int] NULL,[OrderQty] [smallint] NULL,[UnitPrice] [money] NULL,[LineTotal] [numeric](18, 0) NULL')
When this query is executed by the underlying process in the Data Product, the result is the following:
- A staging table is created, with
Staging
as a schema (default when the valueSchemaName
is not specified). The table name is specified by the value inTableName
, and the columns for the table are defined by theTableFieldDefinition
column. - The staging table is populated with data, by executing the query defined in the
Query
column:
SELECT H.SalesOrderId, H.OrderDate, H.SalesOrderNumber, H.ShipMethod, H.SubTotal, D.SalesOrderDetailId, D.OrderQty, D.UnitPrice, D.LineTotal
FROM (SELECT * FROM support.support_SalesLT_SalesOrderHeader TIMESTAMP AS OF "<<TIMESTAMP.support.support_SalesLT_SalesOrderHeader>>") H
LEFT JOIN (SELECT * FROM support.support_SalesLT_SalesOrderDetail TIMESTAMP AS OF "<<TIMESTAMP.support.support_SalesLT_SalesOrderDetail>>") D ON H.SalesOrderId = D.SalesOrderId
WHERE H.SidraIdAsset IN (<<ASSETID.support.support_SalesLT_SalesOrderHeader>>)
, where:
<<TIMESTAMP.??>>
and<<ASSETID.??>>
are placeholders that should be filled by the process, according the Assets that the pipeline is loading.- For the creation of custom queries the placeholders available are:
<<TIMESTAMP.{provider-databaseName}.{stagingConfiguration-tableName}>>
<<ASSETID.{provider-databaseName}.{stagingConfiguration-tableName}>>