Data Product tables¶
A great part of the tables in the Sidra Data Product are similar in structure and content to the ones stored in the Sidra Service. Please see the section related to Sidra Service metadata to check this. Even so, there is slight differences in certain tables or even new ones aimed at extracting the maximum functionality for the customer purposes in Data Products. Below, you can see a summary table with the existing tables for Data Products.
Data Products Tables | Description | Type |
---|---|---|
ActivityTemplate | More information can be checked here in Sidra Service. | Data Factory |
ActivityTemplatePipelineTemplate | Data Factory | |
Asset | More information can be checked here in Sidra Service. | Metadata |
AssetStatus | More information can be checked here in Sidra Service. | Metadata |
AssetTimeTravel | This table is populated by Sidra with the information of Time Travel provided by Client Databricks when the data | |
of the Assets is inserted in Client Databricks. | Performance | |
Attribute | More information can be checked here in Sidra Service. | Metadata |
DataFactory | More information can be checked here in Sidra Service. | Data Factory |
DataSetTemplate | More information can be checked here in Sidra Service. | Data Factory |
DataSetTemplatePipelineTemplate | More information can be checked here in Sidra Service. | Data Factory |
DataSource | More information can be checked here in Sidra Service. | Metadata |
DataSourceTemplate | More information can be checked here in Sidra Service. | Metadata |
DataSourceTemplatePipelineTemplate | More information can be checked here in Sidra Service. | Metadata |
Entity | More information can be checked here in Sidra Service. | Metadata |
EntityPipeline | This table allows to associate the same pipeline with several Entities. | Metadata |
ExtractPipelineExecution | Fill the tables with the data from Sidra Service, leaving information/activities for the stored procedure to execute them. | Data extraction |
ExtractPipelineExecutionStatus | The ExtractPipelineExecutionStatus table is a reference table to enumerate the different statuses involved along a Data Product pipeline execution. | Data extraction |
LoadProcess | Checks the status and reports about some possible error during the process. | Data loading checking |
LoadProcessTask | Gives information about the LoadProcess during some new data ingestion and its destination. | Data loading checking |
LoadStatus | Details the status adding a "Description" field | Data loading checking. |
LoadStatusEnum | Contains the different status of the loading and their definition. | Data loading checking |
Pipeline | More information can be checked here in Sidra Service. For Sidra Data Products, this table contains an ExecutionParameters field that, when filled, the pipeline will be executed by the Sync job. | Data Factory |
PipelineProvider | Relation between pipeline and Provider. | Metadata |
PipelineSyncBehaviours | For more information, please check this section. | Sync |
PipelineTemplate | More information can be checked here in Sidra Service. | Data Factory |
Provider | More information can be checked here in Sidra Service. | Metadata |
StagingConfiguration | Contains the configuration of new table names in Staging that need to be created and the queries to create these new tables. | Configuration |
TableImport | Gives the information about if the data transaction between the staging table and the Sidra Data Product has been properly done; if not, it raises an error. | Data loading checking |
Trigger | More information can be checked here in Sidra Service. | Data Factory |
TriggerPipeline | More information can be checked here in Sidra Service. | Data Factory |
TriggerTemplate | More information can be checked here in Sidra Service. | Data Factory |
PipelineTemplateType | More information can be checked here in Sidra Service. | Data Factory |
PipelineTemplateTypes | Describes the kind of used pipelines. | Data Factory |
TabularMetadata | Stores the name of tables, partition columns and views within the tabular model. | Tabular model |
TabularPartition | Stores information about the tables partitions. | Tabular model |
Measure | Stores information related with the measures used in the tabular model. | Tabular |
RoleMember | Stores the roles by user. | Tabular |
AssetTimeTravel table¶
This table is populated by Sidra with the information of Time Travel provided by Client Databricks when the data of the Assets is inserted in Client Databricks.
Please, check Databricks documentation for more information on the Time Travel functionality.
Column | Description |
---|---|
Id | Id of the Asset where Time Travel definition applies. |
IdStagingConfiguration | Identifier of the staging configuration used to load this Asset. |
VersionNumber | Version number of the Time Travel given by Databricks. |
Timestamp | Timestamp of the Time Travel given by Databricks. |
Step-by-step
How to handle dependant Entities through Databricks app
For a detailed example on how to use these tables and the general functioning of the PI Data Product, please check this tutorial .
EntityPipeline table¶
The same extraction pipeline can be used to extract several Assets from different Entities. That can be configured by associating the same pipeline to several Entities using the EntityPipeline
table.
Column | Description | Required |
---|---|---|
IdPipeline | Identifier of the pipeline | Yes |
IdEntity | Identifier of the related Entity | Yes |
IsMandatory | If true , Assets associated with the IdEntity need to be as a complete set to run the pipeline. If false , complete data calculation will not take into account this IdEntity . By default: true . | Yes |
PipelineExecutionProperties | This a place to put a JSON that could be used in the pipeline for different purposes. By default, NULL, empty or "{}". |
PipelineExecutionProperties column¶
The pipelineExecutionProperties
column in the EntityPipeline
table of Data Products is being used to add a ConsolidationMode
as an Overwrite
parameter which will take action before any data load. A Databricks notebook will be in charge of doing the following functions while checking the parameter status:
- When
ConsolidationMode
is set up asOverwrite
, the entire table will be overwritten. - When
ConsolidationMode
is set up asMerge
(default), the data will be merged if there is a Primary Key, otherwise the data will be appended. - When
ConsolidationMode
is set up asAppend
, the data is appended. - When other value is set up in this parameter, it will throw an error stating that the
ConsolidationMode XXX
is not valid. - When the parameter does not exist or it is empty, it will take the default value of
Merge
.
Do not confuse with consolidation mode options on DSU.
ExtractPipelineExecution table¶
The ExtractPipelineExecution
table is an execution tracking table that is only present in a Data Product.
Column | Description |
---|---|
Id | Id or the record |
IdPipeline | Id of the Data Product pipeline |
PipelineRunId | GUID with the Id of the pipeline execution in ADF |
PipelineExecutionDate | Timestamp of execution of the pipeline |
IdEntity | Id of the Entity |
IdAsset | Id of the Asset |
AssetDate | Business date of the Asset being ingested |
AssetOrder | An order number internally used in the stored procedures for loading the data |
IdExtractPipelineExecutionStatus | Id of the ExtractPipelineExecutionStatus for the status of the Asset during the pipeline load |
If the Sync Behavior (see the Sync Behavior documentation) is LoadUpToLastValidDate
, the Asset load status per pipeline will be checked from this table.
ExtractPipelineExecutionStatus table¶
The ExtractPipelineExecutionStatus
table is a reference table to enumerate the different statuses involved along a Data Product pipeline execution.
Column | Description |
---|---|
Id | Id or the record |
Name | Name of the execution status |
Description | Description of the execution status |
ExtractPipelineExecutionStatuses table¶
Status Id | Status Name | Status Description |
---|---|---|
0 | Extracting | Asset is being copied from the DSU into the Client Staging area |
1 | Staging | Asset already copied into the Staging area |
2 | Finished | Asset already copied into the final tables |
3 | Error | Issue when copying the Asset |
4 | Removed | Asset deleted |
The statuses for this table are updated by some Data Product pipeline activities.
Load tables¶
Tables group designated to check the loadings and their proper operation during the process.
LoadProcess table¶
This table tracks and logs the details of data loading processes.
Column | Description |
---|---|
Id | LoadProcess identifier |
LoadStartDate | Date of start of the LoadProcess |
LoadEndDate | Date of end of the LoadProcess |
IdLoadStatus | Identifier of the LoadStatus, detailed in the LoadStatusEnum table |
ErrorMessage | Error message shown in the case that some error exists during the LoadProcess |
PipelineRunId | Pipeline name used |
IdPipeline | Pipeline identifier used |
LoadProcessTask table¶
This table stores information about monitoring of individual tasks within a data loading process.
Column | Description |
---|---|
Id | Identifier of the task |
IdLoadProcess | LoadProcess identifier |
DestinationTableName | Table name in which the data will be inserted or will be updated |
StartDate | Date of start for the insertion |
EndDate | Date of end for the insertion |
Error | Boolean that shows if some error occurred |
ErrorMessage | Error message shown in the case that some error exists during the LoadProcessTask |
RowsInserted | Number of rows inserted |
RowsUpdated | Number of rows updated |
LoadStatus table¶
Column | Description |
---|---|
LoadStatusEnum Id | Identifier of the status detailed in the LoadStatusEnum table |
StatusName | Name of the status |
Description | Details of the status |
LoadStatusEnum table¶
Identifier | Status |
---|---|
0 | Error |
1 | Loading |
2 | Loaded |
3 | NoFiles |
Pipeline table¶
The Pipeline
table in Data Products has some additional fields compared to the fields that the Pipeline
table has in Sidra Service:
ExecutionParameters
: This is an additional field used for including values for the parameters of the Data Product pipeline when the pipeline is executed by the Sync job.
PipelineProvider table¶
An association between the Pipeline and Provider can be added using the PipelineProvider
table. This association is used by the Sync job to execute only those pipelines that are associated to enabled Providers, i.e. those Providers with false
in the IsDisabled
field.
Column | Description |
---|---|
IdPipeline | Pipeline identifier |
IdProvider | Provider identifier |
PipelineSyncBehaviour table¶
PipelineSyncBehaviour
is a table that defines the different type of behaviour that is going to be executed from the Sync
job perspective. Sync
job is the responsible to synchronize metadata and execute the pipelines associated with the Entities.
Identifier | Status | Description |
---|---|---|
0 | Ignored | Pipeline will be ignored by the Sync job, so probably that pipeline should be executed explicitly using a Trigger |
1 | LoadAllNewDates | The Sync job will provide all new dates for complete set of data not yet imported (with status 2 or 6) when the pipeline is executed |
2 | LoadFirstNewDate | The Sync job will provide the older date for the complete set of data not yet imported (with status 2 or 6) when the pipeline is executed |
3 | LoadLastNewDate | The Sync job will provide the most recent date for the complete set of data not yet imported (with status 2 or 6) when the pipeline is executed |
4 | LoadPendingUpToValidDate | The Sync job will provide all Assets not previously loaded from dates since last execution up to last day with all mandatory Assets available |
5 | LoadUpToLastValidDate | Default. The Sync job will provide all Assets not previously loaded by this same pipeline. Assets that failed to load will be included for retrying a limited number of times, and then discarded |
The LoadUpToLastValidDate
Sync behavior works similarly to type LoadPendingUpToValidDate
, but checking the ExtractPipelineExecution
to work with multiple pipelines. This is the mandatory Sync behavior for loading an Entity with several pipelines. This is also the recommended Sync behavior for all new load pipelines. This Sync behaviour will retry not finished Assets by pipeline for a number of times by default (3), this can be changed through AppService configuration setting up the Sync.AssetRetries
key.
The behaviors LoadPendingUpToValidDate
or LoadUpToLastValidDate
are required in order to avoid situations like the following:
- If there is an error in the Data Intake Process (Sidra Service ingestion, incremental loads) when some of the Entities are loaded but the failed Entities are not loaded. The overall status of the job is
Succeeded
, because this way we do not block the iterative batch process in the pipeline to load all Entities. When this issue happens, e.g. if there are a set of three Entities that are marked as Mandatory on the Data Product side (tableEntityPipeline
), the Sync job will not trigger the pipeline until all three Assets have been generated, which could take several days (e.g. three days, assuming that on day 3 we finally have the three mandatory Assets).
Sync behavior LoadAllNewDates
is only capable of loading all completed date Assets on the date of execution of the Sync job, leaving the already loaded increments in Sidra Service without loading on the Data Product staging tables. For some business logic of Data Products, we may need to know exactly which days there were Assets that were not processed, in order to retrieve this missing data by the Data Product.
Sync behaviors LoadPendingUpToValidDate
or LoadUpToLastValidDate
loads any Asset BEFORE UP TO the most recent day in which there are available Assets for every mandatory Entity.
??? Info "Update the Sync behaviour of pipelines Update the Sync behaviour from the endpoint:
Check here the available endpoints.StagingConfiguration table¶
Data Products contain StagingConfiguration
table, which contains the configuration of the new table names in Staging that need to be created, as well as the queries to create these new tables.
Column | Description |
---|---|
IdPipeline | Id of the pipeline for the extraction of the data from the DSU to the Staging tables (Mandatory). |
Query | Query that defines the way to retrieve the data from the Databricks of the Data Product. If it is not defined, an IdEntity must be specified to retrieve the content by a default query. IdEntity and Query cannot be defined both as NULL at the same time. |
IdEntity | Entity identifier to retrieve the data from the Databricks of the Data Product. By default, it retrieves all Attributes of the Entity. If it is not defined, a Query must be specified to know the way to retrieve the data. IdEntity and Query cannot be both NULL at the same time. |
SchemaName | Schema name used for the staging tables. By default, Staging schema is used. |
TableName | Table name for SQL staging table. |
TableFieldDefinition | Table field definition for the staging table, e.g., [SalesOrderID] [int] NULL,[OrderDate] [datetime] NULL,[SalesOrderNumber] [nvarchar](25) UnitPrice] [money] NULL . If it is not defined, the whole list of Attributes of the Entity is used. TableFieldDefinition and IdEntity cannot be defined as both NULL at the same time. |
The StagingConfiguration
table is needed in order to parametrize these queries in the Databricks notebook.
TableImport table¶
This table stores the name of tables imported from the Data Storage Unit.
Column | Description |
---|---|
Id | Primary key |
SourceTable | Name of the source table |
DestinationSchema | Name of the destination schema |
DestinationTable | Name of the source table |
IdAsset | Id of the imported Asset |
LastUpdated | Timestamp of the last time the asset was imported |
ErrorMessage | Error message in case an error occurred on import |
PipelineTemplateType table¶
This table stores the template definition types for Azure Data Factory pipelines.
Column | Description |
---|---|
Id | Primary key |
Name | ExtractPipelineExecution status name |
Description | ExtractPipelineExecution status description |
PipelineTemplateTypes table¶
This table defines all the pipelines templates types.
Identifier | Type | Description |
---|---|---|
0 | System | Pipelines used internally |
1 | Data Product Intake | Pipelines used for data load processes |
2 | Other | Other kind of pipelines |
TabularMetadata table¶
This table stores the metadata related with the tabular model, defining the structure and relationship of the data within it.
Column | Description |
---|---|
Id | Primary key |
DWTableName | Name of the table in the data warehouse |
DWPartitionColumn | Partitioned column stored in the data warehouse |
DWViewName | View name |
TabularModelTableName | Table name from the tabular model |
TabularPartition table¶
This table provides details about the partitions in the tabular model, as well as metadata like partitioning limits or update dates.
Column | Description | Type |
---|---|---|
Id | Primary key | int |
FactTable | FactTable name | string |
LeftBoundary | Left limit of the partition range | int |
RightBoundary | Right limit of the partition range | int |
PartitionNumber | Partitions number | int |
LastUpdated | Last date of update | datetime |
TabularProcessed | Defines if it is processed or not | boolean |
CreationDate | Date of creation | datetime |
TabularTableName | Name of the tabular table | string |
PartitionName | Name of the partition | string |
Measure table¶
This table contains information about measures used in the tabular model.
Column | Description | Type |
---|---|---|
Id | Primary key | int |
Name | Measure name | string |
Description | Measure description | string |
TableName | Name of the table | string |
Expression | Measure expression in DAX | string |
DisplayFolder | Folder that contains the measure | string |
FormatString | Measure results format | string |
Priority | Identifier of priority | int |
IdProvider | Provider identifier | int |
Removed | Defines if it is removed or not | boolean |
RoleMember table¶
Column | Description | Type |
---|---|---|
Id | Primary key | int |
Role | Role name | string |
User | User name | string |