Data Product tables¶

A great part of the tables in the Sidra Data Product are similar in structure and content to the ones stored in the Sidra Service. Please see the section related to Sidra Service metadata to check this. Even so, there is slight differences in certain tables or even new ones aimed at extracting the maximum functionality for the customer purposes in Data Products. Below, you can see a summary table with the existing tables for Data Products.

Data Products Tables	Description	Type
ActivityTemplate	More information can be checked here in Sidra Service.	Data Factory
ActivityTemplatePipelineTemplate		Data Factory
Asset	More information can be checked here in Sidra Service.	Metadata
AssetStatus	More information can be checked here in Sidra Service.	Metadata
AssetTimeTravel	This table is populated by Sidra with the information of Time Travel provided by Client Databricks when the data
of the Assets is inserted in Client Databricks.	Performance
Attribute	More information can be checked here in Sidra Service.	Metadata
DataFactory	More information can be checked here in Sidra Service.	Data Factory
DataSetTemplate	More information can be checked here in Sidra Service.	Data Factory
DataSetTemplatePipelineTemplate	More information can be checked here in Sidra Service.	Data Factory
DataSource	More information can be checked here in Sidra Service.	Metadata
DataSourceTemplate	More information can be checked here in Sidra Service.	Metadata
DataSourceTemplatePipelineTemplate	More information can be checked here in Sidra Service.	Metadata
Entity	More information can be checked here in Sidra Service.	Metadata
EntityPipeline	This table allows to associate the same pipeline with several Entities.	Metadata
ExtractPipelineExecution	Fill the tables with the data from Sidra Service, leaving information/activities for the stored procedure to execute them.	Data extraction
ExtractPipelineExecutionStatus	The `ExtractPipelineExecutionStatus` table is a reference table to enumerate the different statuses involved along a Data Product pipeline execution.	Data extraction
LoadProcess	Checks the status and reports about some possible error during the process.	Data loading checking
LoadProcessTask	Gives information about the LoadProcess during some new data ingestion and its destination.	Data loading checking
LoadStatus	Details the status adding a "Description" field	Data loading checking.
LoadStatusEnum	Contains the different status of the loading and their definition.	Data loading checking
Pipeline	More information can be checked here in Sidra Service. For Sidra Data Products, this table contains an `ExecutionParameters` field that, when filled, the pipeline will be executed by the Sync job.	Data Factory
PipelineProvider	Relation between pipeline and Provider.	Metadata
PipelineSyncBehaviours	For more information, please check this section.	Sync
PipelineTemplate	More information can be checked here in Sidra Service.	Data Factory
Provider	More information can be checked here in Sidra Service.	Metadata
StagingConfiguration	Contains the configuration of new table names in Staging that need to be created and the queries to create these new tables.	Configuration
TableImport	Gives the information about if the data transaction between the staging table and the Sidra Data Product has been properly done; if not, it raises an error.	Data loading checking
Trigger	More information can be checked here in Sidra Service.	Data Factory
TriggerPipeline	More information can be checked here in Sidra Service.	Data Factory
TriggerTemplate	More information can be checked here in Sidra Service.	Data Factory
PipelineTemplateType	More information can be checked here in Sidra Service.	Data Factory
PipelineTemplateTypes	Describes the kind of used pipelines.	Data Factory
TabularMetadata	Stores the name of tables, partition columns and views within the tabular model.	Tabular model
TabularPartition	Stores information about the tables partitions.	Tabular model
Measure	Stores information related with the measures used in the tabular model.	Tabular
RoleMember	Stores the roles by user.	Tabular

AssetTimeTravel table¶

This table is populated by Sidra with the information of Time Travel provided by Client Databricks when the data of the Assets is inserted in Client Databricks.

Please, check Databricks documentation for more information on the Time Travel functionality.

Column	Description
Id	Id of the Asset where Time Travel definition applies.
IdStagingConfiguration	Identifier of the staging configuration used to load this Asset.
VersionNumber	Version number of the Time Travel given by Databricks.
Timestamp	Timestamp of the Time Travel given by Databricks.

Step-by-step

How to handle dependant Entities through Databricks app

For a detailed example on how to use these tables and the general functioning of the PI Data Product, please check this tutorial .

EntityPipeline table¶

The same extraction pipeline can be used to extract several Assets from different Entities. That can be configured by associating the same pipeline to several Entities using the EntityPipeline table.

Column	Description	Required
IdPipeline	Identifier of the pipeline	Yes
IdEntity	Identifier of the related Entity	Yes
IsMandatory	If `true`, Assets associated with the `IdEntity` need to be as a complete set to run the pipeline. If `false`, complete data calculation will not take into account this `IdEntity`. By default: `true`.	Yes
PipelineExecutionProperties	This a place to put a JSON that could be used in the pipeline for different purposes. By default, NULL, empty or "{}".

PipelineExecutionProperties column¶

The pipelineExecutionProperties column in the EntityPipeline table of Data Products is being used to add a ConsolidationMode as an Overwrite parameter which will take action before any data load. A Databricks notebook will be in charge of doing the following functions while checking the parameter status:

When ConsolidationMode is set up as Overwrite, the entire table will be overwritten.
When ConsolidationMode is set up as Merge (default), the data will be merged if there is a Primary Key, otherwise the data will be appended.
When ConsolidationMode is set up as Append, the data is appended.
When other value is set up in this parameter, it will throw an error stating that the ConsolidationMode XXX is not valid.
When the parameter does not exist or it is empty, it will take the default value of Merge.

Do not confuse with consolidation mode options on DSU.

ExtractPipelineExecution table¶

The ExtractPipelineExecution table is an execution tracking table that is only present in a Data Product.

Column	Description
Id	Id or the record
IdPipeline	Id of the Data Product pipeline
PipelineRunId	GUID with the Id of the pipeline execution in ADF
PipelineExecutionDate	Timestamp of execution of the pipeline
IdEntity	Id of the Entity
IdAsset	Id of the Asset
AssetDate	Business date of the Asset being ingested
AssetOrder	An order number internally used in the stored procedures for loading the data
IdExtractPipelineExecutionStatus	Id of the `ExtractPipelineExecutionStatus` for the status of the Asset during the pipeline load

If the Sync Behavior (see the Sync Behavior documentation) is LoadUpToLastValidDate, the Asset load status per pipeline will be checked from this table.

ExtractPipelineExecutionStatus table¶

The ExtractPipelineExecutionStatus table is a reference table to enumerate the different statuses involved along a Data Product pipeline execution.

Column	Description
Id	Id or the record
Name	Name of the execution status
Description	Description of the execution status

ExtractPipelineExecutionStatuses table¶

Status Id	Status Name	Status Description
0	Extracting	Asset is being copied from the DSU into the Client Staging area
1	Staging	Asset already copied into the Staging area
2	Finished	Asset already copied into the final tables
3	Error	Issue when copying the Asset
4	Removed	Asset deleted

The statuses for this table are updated by some Data Product pipeline activities.

Load tables¶

Tables group designated to check the loadings and their proper operation during the process.

LoadProcess table¶

This table tracks and logs the details of data loading processes.

Column	Description
Id	`LoadProcess` identifier
LoadStartDate	Date of start of the `LoadProcess`
LoadEndDate	Date of end of the `LoadProcess`
IdLoadStatus	Identifier of the LoadStatus, detailed in the LoadStatusEnum table
ErrorMessage	Error message shown in the case that some error exists during the `LoadProcess`
PipelineRunId	Pipeline name used
IdPipeline	Pipeline identifier used

LoadProcessTask table¶

This table stores information about monitoring of individual tasks within a data loading process.

Column	Description
Id	Identifier of the task
IdLoadProcess	`LoadProcess` identifier
DestinationTableName	Table name in which the data will be inserted or will be updated
StartDate	Date of start for the insertion
EndDate	Date of end for the insertion
Error	Boolean that shows if some error occurred
ErrorMessage	Error message shown in the case that some error exists during the `LoadProcessTask`
RowsInserted	Number of rows inserted
RowsUpdated	Number of rows updated

LoadStatus table¶

Column	Description
LoadStatusEnum Id	Identifier of the status detailed in the `LoadStatusEnum` table
StatusName	Name of the status
Description	Details of the status

LoadStatusEnum table¶

Identifier	Status
0	Error
1	Loading
2	Loaded
3	NoFiles

Pipeline table¶

The Pipeline table in Data Products has some additional fields compared to the fields that the Pipeline table has in Sidra Service:

ExecutionParameters: This is an additional field used for including values for the parameters of the Data Product pipeline when the pipeline is executed by the Sync job.

PipelineProvider table¶

An association between the Pipeline and Provider can be added using the PipelineProvider table. This association is used by the Sync job to execute only those pipelines that are associated to enabled Providers, i.e. those Providers with false in the IsDisabled field.

Column	Description
IdPipeline	Pipeline identifier
IdProvider	Provider identifier

PipelineSyncBehaviour table¶

PipelineSyncBehaviour is a table that defines the different type of behaviour that is going to be executed from the Sync job perspective. Sync job is the responsible to synchronize metadata and execute the pipelines associated with the Entities.

Identifier	Status	Description
0	Ignored	Pipeline will be ignored by the `Sync` job, so probably that pipeline should be executed explicitly using a `Trigger`
1	LoadAllNewDates	The `Sync` job will provide all new dates for complete set of data not yet imported (with status 2 or 6) when the pipeline is executed
2	LoadFirstNewDate	The `Sync` job will provide the older date for the complete set of data not yet imported (with status 2 or 6) when the pipeline is executed
3	LoadLastNewDate	The `Sync` job will provide the most recent date for the complete set of data not yet imported (with status 2 or 6) when the pipeline is executed
4	LoadPendingUpToValidDate	The `Sync` job will provide all Assets not previously loaded from dates since last execution up to last day with all mandatory Assets available
5	LoadUpToLastValidDate	Default. The `Sync` job will provide all Assets not previously loaded by this same pipeline. Assets that failed to load will be included for retrying a limited number of times, and then discarded

The LoadUpToLastValidDate Sync behavior works similarly to type LoadPendingUpToValidDate, but checking the ExtractPipelineExecution to work with multiple pipelines. This is the mandatory Sync behavior for loading an Entity with several pipelines. This is also the recommended Sync behavior for all new load pipelines. This Sync behaviour will retry not finished Assets by pipeline for a number of times by default (3), this can be changed through AppService configuration setting up the Sync.AssetRetries key.

The behaviors LoadPendingUpToValidDate or LoadUpToLastValidDate are required in order to avoid situations like the following:

If there is an error in the Data Intake Process (Sidra Service ingestion, incremental loads) when some of the Entities are loaded but the failed Entities are not loaded. The overall status of the job is Succeeded, because this way we do not block the iterative batch process in the pipeline to load all Entities. When this issue happens, e.g. if there are a set of three Entities that are marked as Mandatory on the Data Product side (table EntityPipeline), the Sync job will not trigger the pipeline until all three Assets have been generated, which could take several days (e.g. three days, assuming that on day 3 we finally have the three mandatory Assets).

Sync behavior LoadAllNewDates is only capable of loading all completed date Assets on the date of execution of the Sync job, leaving the already loaded increments in Sidra Service without loading on the Data Product staging tables. For some business logic of Data Products, we may need to know exactly which days there were Assets that were not processed, in order to retrieve this missing data by the Data Product.

Sync behaviors LoadPendingUpToValidDate or LoadUpToLastValidDate loads any Asset BEFORE UP TO the most recent day in which there are available Assets for every mandatory Entity.

??? Info "Update the Sync behaviour of pipelines Update the Sync behaviour from the endpoint:

PUT /api/pipeline

Check here the available endpoints.

StagingConfiguration table¶

Data Products contain StagingConfiguration table, which contains the configuration of the new table names in Staging that need to be created, as well as the queries to create these new tables.

Column	Description
IdPipeline	Id of the pipeline for the extraction of the data from the DSU to the Staging tables (Mandatory).
Query	Query that defines the way to retrieve the data from the Databricks of the Data Product. If it is not defined, an IdEntity must be specified to retrieve the content by a default query. `IdEntity` and `Query` cannot be defined both as NULL at the same time.
IdEntity	Entity identifier to retrieve the data from the Databricks of the Data Product. By default, it retrieves all Attributes of the Entity. If it is not defined, a `Query` must be specified to know the way to retrieve the data. `IdEntity` and `Query` cannot be both NULL at the same time.
SchemaName	Schema name used for the staging tables. By default, Staging schema is used.
TableName	Table name for SQL staging table.
TableFieldDefinition	Table field definition for the staging table, e.g., `[SalesOrderID] [int] NULL,[OrderDate] [datetime] NULL,[SalesOrderNumber] [nvarchar](25) UnitPrice] [money] NULL`. If it is not defined, the whole list of Attributes of the Entity is used. `TableFieldDefinition` and `IdEntity` cannot be defined as both NULL at the same time.

The StagingConfiguration table is needed in order to parametrize these queries in the Databricks notebook.

TableImport table¶

This table stores the name of tables imported from the Data Storage Unit.

Column	Description
Id	Primary key
SourceTable	Name of the source table
DestinationSchema	Name of the destination schema
DestinationTable	Name of the source table
IdAsset	Id of the imported Asset
LastUpdated	Timestamp of the last time the asset was imported
ErrorMessage	Error message in case an error occurred on import

PipelineTemplateType table¶

This table stores the template definition types for Azure Data Factory pipelines.

Column	Description
Id	Primary key
Name	ExtractPipelineExecution status name
Description	ExtractPipelineExecution status description

PipelineTemplateTypes table¶

This table defines all the pipelines templates types.

Identifier	Type	Description
0	System	Pipelines used internally
1	Data Product Intake	Pipelines used for data load processes
2	Other	Other kind of pipelines

TabularMetadata table¶

This table stores the metadata related with the tabular model, defining the structure and relationship of the data within it.

Column	Description
Id	Primary key
DWTableName	Name of the table in the data warehouse
DWPartitionColumn	Partitioned column stored in the data warehouse
DWViewName	View name
TabularModelTableName	Table name from the tabular model

TabularPartition table¶

This table provides details about the partitions in the tabular model, as well as metadata like partitioning limits or update dates.

Column	Description	Type
Id	Primary key	int
FactTable	FactTable name	string
LeftBoundary	Left limit of the partition range	int
RightBoundary	Right limit of the partition range	int
PartitionNumber	Partitions number	int
LastUpdated	Last date of update	datetime
TabularProcessed	Defines if it is processed or not	boolean
CreationDate	Date of creation	datetime
TabularTableName	Name of the tabular table	string
PartitionName	Name of the partition	string

Measure table¶

This table contains information about measures used in the tabular model.

Column	Description	Type
Id	Primary key	int
Name	Measure name	string
Description	Measure description	string
TableName	Name of the table	string
Expression	Measure expression in DAX	string
DisplayFolder	Folder that contains the measure	string
FormatString	Measure results format	string
Priority	Identifier of priority	int
IdProvider	Provider identifier	int
Removed	Defines if it is removed or not	boolean

RoleMember table¶

Column	Description	Type
Id	Primary key	int
Role	Role name	string
User	User name	string