GenerateDeltaTable Configuration¶

≥ 2022.R3

Sidra includes support for the Change Data Feed feature in Azure Databricks, enabling efficient tracking of row-level changes in Delta tables. You can enable this feature through the GenerateDeltaTable property in the Entity metadata table.

The following table illustrates how Databricks records different change operations:

Change Type	Result in Delta Table
Fields	Original columns + `_change_type`, `_commit_version`, `_commit_timestamp`
Insertions	New row added, `_change_type = insert`
Updates	Two rows added per update: original data (`_change_type = update_preimage`) and updated data (`_change_type = update_postimage`), both sharing `_commit_version` and `_commit_timestamp`
Deletions	Row added with deleted values and `_change_type = delete`

Configuration¶

Considerations¶

To modify the GenerateDeltaTable property, initial metadata must exist for the Entity. After the first metadata ingestion, you need to execute the metadata pipeline again to apply the updated configuration. This execution can be done by:

Directly running the metadata pipeline.
Triggering the orchestrator via Sidra Web if schema evolution is enabled. Ensure you set "Do you want the metadata to be refreshed automatically with each data extraction?" to true.

Configuration Steps¶

To configure GenerateDeltaTable, follow these steps:

Set GenerateDeltaTable on the Entity to true.
Update the Entity's LastUpdated timestamp to be later than its LastDeployed timestamp.
Execute the orchestrator for the Data Intake Process (DIP) or the FileIngestionDatabricks ADF pipeline for the Provider.
After the pipeline execution, verify changes in the Databricks table history by running:

-- Run this in Databricks SQL notebook
%sql DESCRIBE HISTORY [Provider].[table]

To verify tracking, modify, add, or delete data at the source, then query the tracked changes:

-- Timestamp indicates the starting point for reviewing changes
%sql SELECT * FROM table_changes('[provider].[table]', '2022-04-21 05:45:46')

In the example above, the Delta table view shows three inserts, one update, and one deletion.

Databricks saves tracked changes in a folder (_change_data) within the same directory as the Delta table, storing them as Parquet files.

Interaction with Other Entity Parameters¶

The results displayed above are produced by enabling delta.enableChangeDataFeed = true on the Databricks table.

Warning

Do not confuse this configuration with the Consolidation Mode options in Data Products, which are set through the pipelineExecutionProperties column.

Consolidation Mode: Snapshot¶

In Snapshot mode, every ingestion appends all data, marking each record as an insert. Deleted records are also inserted as new rows, but with all columns set to NULL, except Id and SidraIsDeleted.

Consolidation Mode: Merge¶

In Merge mode:

Inserts appear as single-row operations.
Updates produce two rows each, representing the original and updated data.
Deletions appear as inserts with all columns set to NULL, except the Id and the column SidraIsDeleted set to true.