How encrypted data ingestion works

The data ingestion in Sidra comprises two copies of the information:

  • The raw copy of the Asset that is stored in an Azure Storage container.
  • The optimized storage of the information that is loaded into Databricks.

NOTE: From version 1.8.2, an additional step has been added to ensure that the unencrypted raw copy is not stored in the same location as the encrypted raw copy. There is a first additional stage where the unencrypted file is moved to a temporary location, and once encrypted, the file is moved to the raw zone.

When it comes to the encryption of information:

  • The raw copy is totally encrypted. The entire file is encrypted and stored in the Azure Storage container.
  • The information loaded into Databricks will be partially encrypted, that is, only those attributes marked with the IsEncrypted flag will be encrypted.

How encryption is configured in the platform

The information used in the encryption process is provided to the platform via two mechanisms:

  1. In the Sidra Core deployment: Some parameters need to be configured in the Azure DevOps release pipeline in order to use encryption.
  2. In the metadata database. Additional information must be configured in some tables of the metadata database to identify the content that must be encrypted.

Both mechanisms are detailed below:

Azure DevOps configuration

In order to enable any kind of encryption it is necessary to create and populate the following variables:

  • EncryptionKey is a 16, 24 or 32-byte long string, randomly generated.
  • EncryptionInitializationVector is a 16-byte long string, randomly generated.

Both variables must be created inside an Azure DevOps library.

These variables will be used during the Release pipeline by the script that configures Databricks. The Databricks configuration script will add them as the following Databricks secrets: key and initialization_vector respectively.

Metadata configuration

The information about what Assets, Entities and Attributes are encrypted is included in the metadata database along with the rest of metadata for those elements.

Assets

**Since version 1.8.2 the Assets table does no longer contain the field IsEncrypted.

In previous versions of Sidra the Asset table contains a field IsEncrypted that tells if the Asset is encrypted. The field is managed by the platform -not to be configured by users- and it is updated by the data ingestion pipeline after the encryption of the raw copy if the file is completed. It is also used to know if it is required to decrypt the raw copy when reloading the file -i.e. re-ingesting the file- or extracting it.

Entities

The Entity table contains a field AdditionalProperties which stores a JSON structure used for adding additional information about the Entity.

Users can use this field to configure encryption setting for the Assets associated to the Entity.

The flag for enabling encryption is conveyed through the following line in the JSON:

1
2
3
{
    "isEncrypted": true
}

This information is used by the data ingestion pipeline to know if an Asset must be encrypted.

Attributes

The Attribute table contains a field IsEncrypted, that users can use to identify which Attributes must be encrypted in Databricks.

This information is used by the GenerateTransferQuery custom activity to create a transfer query that encrypts those Attributes.

Encryption pipeline

The Entity that requires encryption must be associated to an encryption ingestion pipeline, that means that the table EntityPipeline must contains the association between the Entity and the pipeline FileIngestionWithEncryption instead of being associated to the usual FileIngestionDatabricks.

How it works

From version 1.8.2

The pre-existing FileIngestionWithEncryption pipeline is removed from this version in favor of using the pipeline FileIngestionDatabricks.

The following steps are followed:

  • Step 1: Move the file to a temporary zone, encrypt and move to raw zone

The RegisterAsset pipeline moves the file to a temporary zone and, only once the file is encrypted, the file is moved to the raw zone.

To encrypt the raw copy a Python Notebook activity is used, which executes the FileEncryption.py notebook included in Sidra and deployed in the Shared folder in Databricks. This notebook uses the Databricks secrets configured by the DevOps Release pipeline.

  • Step 2: Use transfer query to load the Asset into Databricks

    After this encryption executes, which is required to move the file to the raw zone, the next stage of the ingestion process involves the execution of the transfer query. Transfer query copies the data to the DSU.

    The transfer query script decrypts the file in a temporary zone before ingesting the data in the DSU.

    The transfer query is generated specifically for the Entity and it is based on the metadata information. The transfer query is generated by the GenerateTransferQuery custom activity that checks the IsEncrypted field of the Attributes to generate a Spark SQL query that loads those Attributes encrypted into Databricks.

    After the generation of the transfer query script, its execution is done via a Notebook activity launched by the pipeline.

Prior to version 1.8.2

Prior to version 1-8.2, the encryption is performed by the FileIngestionWithEncryption ADF pipeline, which is launched after the storage of the raw copy of the Asset in the Data Storage Unit (DSU).

This raw copy is not encrypted yet. More information about where that raw copy comes from can be found in the Overview section.

The encryption data ingestion pipeline follows these steps:

  • Step 1: Use transfer query to load the Asset into Databricks:

    The plain -not encrypted- raw copy is used to load the information into Databricks using the transfer query script. More information about how the transfer query works can be found in this section.

    The transfer query is generated specifically for the Entity and it is based on the metadata information. The transfer query is generated by the GenerateTransferQuery custom activity that checks the IsEncrypted field of the Attributes to generate a Spark SQL query that loads those Attributes encrypted into Databricks.

    After the generation of the transfer query script, its execution is done via a Notebook activity launched by the pipeline.

  • Step 2: Encrypt plain raw copy:

    While the transfer query is generated and executed, an encrypted version of the raw copy of the Asset is created in the same raw zone. To encrypt the raw copy a Python Notebook activity is used, which executes the FileEncryption.py notebook included in Sidra and deployed in the Shared folder in Databricks.

    This notebook uses the Databricks secrets configured by the DevOps Release pipeline.

  • Step 3: Clean up after process finish:

    When the load into Databricks and the creation of the encrypted raw copy are finished, the following actions are performed:

    1. The plain raw copy is deleted.

    2. The encrypted raw copy is renamed with the original filename of the plain raw copy.

    3. The Asset metadata is marked as encrypted by activating the flag IsEncripted in the Asset table.