How to alter Entity's Attributes¶

When a source table that is ingested by Sidra is being altered, its corresponding metadata Entity needs to be updated. By table alteration we mean definition change: columns are added, changed, or dropped from the source table. Without the metadata update, the ingested target table – in Data Lake – may not reflect the source table or the ingestion may simply stop due to errors.

Information

Currently, the Schema Evolution feature of Sidra only covers new column additions in the source tables.

Introduction¶

Sidra supports automatic metadata update only when new columns are added in the source table, and only when the Data Intake Process has the Schema Evolution feature enabled.

For other source table alterations – such as column definition changes or column deletions – the metadata must be manually changed in the Sidra’s [Core] database and the Data Lake target Databricks table. The same requirement applies if new columns are being added in the source table but the Data Intake Process does not have the Schema Evolution feature enabled.

This document guides through the steps that must be completed.

Steps to perform¶

In short, this is the sequence of steps to take: 1. Stop ingestion: Make sure that no pipeline which could ingest the source table is running. 2. Backup target table: Optionally create a clone of the target Data Lake table in Databricks. 3. Update Attributes: Edit the list of Entity’s Attributes, to reflect the new definition of the source ingested table. 4. Flag the changes: Edit the Entity’s record, to let Sidra “know” that the Entity has changed. + The Last Updated field must be changed; + Decide if the target Data Lake table should be re-created. 5. Check incremental loads: Edit the Entity's Last Delta Loading value, if the table is incrementally ingested. 6. Optionally redeploy the Entity; or wait for the next execution of the ingestion pipeline. 7. Verify, in Databricks, if the target Data Lake table reflects the change. 8. Resume ingestion, if it was stopped in step 1.

More context for the steps above:

1. Stop ingestion¶

Make sure, in the Azure Data Factory of the DSU, that the ingestion pipeline of the corresponding Data Intake Process is not running. You may want to temporarily stop the trigger of that pipeline until these steps are completed.

2. Backup Databricks table¶

The source table definition changes may be extensive, calling for a re-creation of the target table in Databricks. For instance, when columns are dropped in the source table.

If the table would be re-created, the history of the Databricks table – its historical records – would be lost. Therefore, if such history should be preserved, clone-copying the target Databricks table, as a backup, is necessary.

Step 4 details how to set the flag that would trigger the re-creation of the target table in Databricks.

3. Entity’s Attributes¶

In Sidra’s [Core] database, identify the Entity record corresponding to the source table, in the [DataIngestion].[Entity] table; the ID key of the Entity record is needed. Then, list the [DataIngestion].[Attribute] records, corresponding to the columns of the source table. Edit this list according to the changes in the source table. Delete record(s) if column(s) were dropped from the source table; or change/add records for columns that have been altered or added.

Keep in mind that the [Order] column is relevant and that the system/metadata attributes must be last, when sorted by the [Order] column.

Also, map correctly the [SQLType], [HiveType], and [SourceType] fields to the corresponding source column type. Other columns of [DataIngestion].[Attribute] may be relevant too for data validation or usage in Data Products; their names are self-explanatory.

4. Entity’s record¶

In Sidra’s [Core] database, identify the [DataIngestion].[Entity] record corresponding to the source table. Edit the record:

In particular, the [LastUpdated] should be more recent than [LastDeployed]. This signals to Sidra that the Entity may require changes in the ingestion process.
Also, the boolean field [ReCreateTableOnDeployment] indicates that the metadata changes are extensive enough to require a re-build of the target Data Lake table in Databricks. Setting [ReCreateTableOnDeployment] to true will drop and re-create the target Data Lake table; hence, the historical records of the table in Databricks will be lost.

Note Before setting [ReCreateTableOnDeployment], evaluate if the previously ingested data (the historical records) of the source table should be preserved; especially if columns are being dropped. The table clone backup mentioned in step 2 would help keeping those historical records, without any interference with Sidra's working.

5. Last Delta Loading¶

Keep in mind the incremental load mechanisms in Sidra. If the target Data Lake table is being re-created, using the [ReCreateTableOnDeployment]=true flag, and the source table is being incrementally loaded, then a full reload must be set; a delta load would not reflect all the source records. In Sidra’s [Core] database, try to identify a record in [DataIngestion].[EntityDeltaLoad] table for the Entity.

If there is no record for the Entity in the table, it means that the source table is not incrementally loaded. So, the source table is being fully loaded with each ingestion. In such case, there is nothing further to do at this step.
If there is a record for the Entity in the table, signal to Sidra that, with the next ingestion, the table should be fully reloaded from the source:
- Set its fields [LastDeltaValue] and [LastAuxiliaryDeltaValue] to NULL;
- Set its fields [NeedReload] and [EnableReload] to true.

6. Redeploy Entity¶

With the next ingestion pipeline execution for the Data Intake Process: + The targe table will be re-created in Databricks, if the [ReCreateTableOnDeployment] field was set to true in [DataIngestion].[Entity] table. + The table data will be read, ingested, and validated, using the new table definition, according to the new Entity metadata. + The table will be ingested in full, with all its records from the source, if the Entity has no record in the [DataIngestion].[EntityDeltaLoad] table or the fields were set as in step 5.

Alternatively, to only re-create the target table in Databricks, one can execute the PrepareDataLoad pipeline in the Data Factory of the DSU. If this pipeline is being executed, leave its transferQuery execution parameter null.

7. Verify¶

Once the next ingestion pipeline execution took place for the corresponding Data Intake Process, access the Data Lake catalogue and the Provider’s database to make sure that the target Databricks table has been re-created (if that’s the case) and if its new definition reflects the metadata changes performed.

8. Resume ingestion¶

If the Data Factory pipeline trigger was disabled in step 1, it is time to re-enable it.