GenerateDeltaTable Configuration¶
Sidra includes support for the Change Data Feed feature in Azure Databricks, enabling efficient tracking of row-level changes in Delta tables. You can enable this feature through the GenerateDeltaTable
property in the Entity metadata table.
The following table illustrates how Databricks records different change operations:
Change Type | Result in Delta Table |
---|---|
Fields | Original columns + _change_type , _commit_version , _commit_timestamp |
Insertions | New row added, _change_type = insert |
Updates | Two rows added per update: original data (_change_type = update_preimage ) and updated data (_change_type = update_postimage ), both sharing _commit_version and _commit_timestamp |
Deletions | Row added with deleted values and _change_type = delete |
Configuration¶
Considerations¶
To modify the GenerateDeltaTable
property, initial metadata must exist for the Entity. After the first metadata ingestion, you need to execute the metadata pipeline again to apply the updated configuration. This execution can be done by:
- Directly running the metadata pipeline.
- Triggering the orchestrator via Sidra Web if schema evolution is enabled. Ensure you set "Do you want the metadata to be refreshed automatically with each data extraction?" to
true
.
Configuration Steps¶
To configure GenerateDeltaTable
, follow these steps:
- Set
GenerateDeltaTable
on the Entity totrue
. - Update the Entity's
LastUpdated
timestamp to be later than itsLastDeployed
timestamp. - Execute the orchestrator for the Data Intake Process (DIP) or the
FileIngestionDatabricks
ADF pipeline for the Provider. - After the pipeline execution, verify changes in the Databricks table history by running:
To verify tracking, modify, add, or delete data at the source, then query the tracked changes:
-- Timestamp indicates the starting point for reviewing changes
%sql SELECT * FROM table_changes('[provider].[table]', '2022-04-21 05:45:46')
In the example above, the Delta table view shows three inserts, one update, and one deletion.
Databricks saves tracked changes in a folder (_change_data
) within the same directory as the Delta table, storing them as Parquet files.
Interaction with Other Entity Parameters¶
The results displayed above are produced by enabling delta.enableChangeDataFeed = true
on the Databricks table.
Warning
Do not confuse this configuration with the Consolidation Mode options in Data Products, which are set through the pipelineExecutionProperties
column.
Consolidation Mode: Snapshot¶
In Snapshot mode, every ingestion appends all data, marking each record as an insert. Deleted records are also inserted as new rows, but with all columns set to NULL
, except Id
and SidraIsDeleted
.
Consolidation Mode: Merge¶
In Merge mode:
- Inserts appear as single-row operations.
- Updates produce two rows each, representing the original and updated data.
- Deletions appear as inserts with all columns set to
NULL
, except theId
and the columnSidraIsDeleted
set totrue
.