Skip to content

GenerateDeltaTable Configuration

≥ 2022.R3

Sidra includes support for the Change Data Feed feature in Azure Databricks, enabling efficient tracking of row-level changes in Delta tables. You can enable this feature through the GenerateDeltaTable property in the Entity metadata table.

The following table illustrates how Databricks records different change operations:

Change Type Result in Delta Table
Fields Original columns + _change_type, _commit_version, _commit_timestamp
Insertions New row added, _change_type = insert
Updates Two rows added per update: original data (_change_type = update_preimage) and updated data (_change_type = update_postimage), both sharing _commit_version and _commit_timestamp
Deletions Row added with deleted values and _change_type = delete

Configuration

Considerations

To modify the GenerateDeltaTable property, initial metadata must exist for the Entity. After the first metadata ingestion, you need to execute the metadata pipeline again to apply the updated configuration. This execution can be done by:

  • Directly running the metadata pipeline.
  • Triggering the orchestrator via Sidra Web if schema evolution is enabled. Ensure you set "Do you want the metadata to be refreshed automatically with each data extraction?" to true.

Configuration Steps

To configure GenerateDeltaTable, follow these steps:

  1. Set GenerateDeltaTable on the Entity to true.
  2. Update the Entity's LastUpdated timestamp to be later than its LastDeployed timestamp.
  3. Execute the orchestrator for the Data Intake Process (DIP) or the FileIngestionDatabricks ADF pipeline for the Provider.
  4. After the pipeline execution, verify changes in the Databricks table history by running:
-- Run this in Databricks SQL notebook
%sql DESCRIBE HISTORY [Provider].[table]

Change Data Feed Enabled

To verify tracking, modify, add, or delete data at the source, then query the tracked changes:

-- Timestamp indicates the starting point for reviewing changes
%sql SELECT * FROM table_changes('[provider].[table]', '2022-04-21 05:45:46')

Tracked Changes

In the example above, the Delta table view shows three inserts, one update, and one deletion.

Databricks saves tracked changes in a folder (_change_data) within the same directory as the Delta table, storing them as Parquet files.

Interaction with Other Entity Parameters

The results displayed above are produced by enabling delta.enableChangeDataFeed = true on the Databricks table.

Warning

Do not confuse this configuration with the Consolidation Mode options in Data Products, which are set through the pipelineExecutionProperties column.

Consolidation Mode: Snapshot

In Snapshot mode, every ingestion appends all data, marking each record as an insert. Deleted records are also inserted as new rows, but with all columns set to NULL, except Id and SidraIsDeleted.

Snapshot Mode Example

Consolidation Mode: Merge

In Merge mode:

  • Inserts appear as single-row operations.
  • Updates produce two rows each, representing the original and updated data.
  • Deletions appear as inserts with all columns set to NULL, except the Id and the column SidraIsDeleted set to true.

Merge Mode Example