Skip to content

How encrypted data ingestion works

Versions

  • This old encryption process has been maintained for retro-compatibility, but with the support of native encryption in Databricks, this old encryption feature in Sidra is no longer supported. A new feature may be incorporated in the roadmap to leverage this new mechanism.
  • From version 2021.R1, an additional step has been added to ensure that the unencrypted raw copy is not stored in the same location as the encrypted raw copy. There is a first additional stage where the unencrypted file is moved to a temporary location, and once encrypted, the file is moved to the raw zone.

The data ingestion in Sidra comprises two copies of the information:

  • The raw copy of the Asset that is stored in an Azure Storage container.
  • The optimized storage of the information that is loaded into Databricks.

When it comes to the encryption of information:

  • The raw copy is totally encrypted. The entire file is encrypted and stored in the Azure Storage container.
  • The information loaded into Databricks will be partially encrypted, that is, only those Attributes marked with the IsEncrypted flag will be encrypted.

How encryption is configured in the platform

The information used in the encryption process is provided to the platform via two mechanisms:

  1. In the Sidra Core deployment: Some parameters need to be configured in the Azure DevOps release pipeline in order to use encryption.
  2. In the metadata database. Additional information must be configured in some tables of the metadata database to identify the content that must be encrypted.

Both mechanisms are detailed below:

Azure DevOps configuration

In order to enable any kind of encryption it is necessary to create and populate the following variables:

  • EncryptionKey is a 16, 24 or 32-byte long string, randomly generated.
  • EncryptionInitializationVector is a 16-byte long string, randomly generated.

Both variables must be created inside an Azure DevOps library.

These variables will be used during the Release pipeline by the script that configures Databricks. The Databricks configuration script will add them as the following Databricks secrets: key and initialization_vector respectively.

Metadata configuration

The information about what Assets, Entities and Attributes are encrypted is included in the metadata database along with the rest of metadata for those elements.

Entities

The Entity table contains a field AdditionalProperties which stores a JSON structure used for adding additional information about the Entity.

Users can use this field to configure encryption setting for the Assets associated to the Entity.

The flag for enabling encryption is conveyed through the following line in the JSON:

{
    "assetsEncrypted": true
}

This information is used by the data ingestion pipeline to know if an Asset must be encrypted.

Attributes

The Attribute table contains a field IsEncrypted, that users can use to identify which Attributes must be encrypted in Databricks.

This information is used in the legacy transfer query / DSU ingestion to encrypt those Attributes.

Encryption pipeline

The Entity that requires encryption must be associated to an encryption ingestion pipeline, that means that the table EntityPipeline must contains the association between the Entity and the pipeline FileIngestionWithEncryption instead of being associated to the usual FileIngestionDatabricks.

How it works

The pre-existing FileIngestionWithEncryption pipeline is removed from this version in favor of using the pipeline FileIngestionDatabricks.

The following steps are followed:

  • Step 1: Move the file to a temporary zone, encrypt and move to raw zone

The RegisterAsset pipeline moves the file to a temporary zone and, only once the file is encrypted, the file is moved to the raw zone.

To encrypt the raw copy a Python Notebook activity is used, which executes the FileEncryption.py notebook included in Sidra and deployed in the Shared folder in Databricks. This notebook uses the Databricks secrets configured by the DevOps Release pipeline.

  • Step 2: Use DSU ingestion to load the Asset into Databricks

    After this encryption executes, which is required to move the file to the raw zone, the next stage of the ingestion process involves the execution of the autogenerated transfer query / DSU ingestion which copies the data to the DSU.

    The autogenerated transfer query / DSU ingestion decrypts the file in a temporary zone before ingesting the data in the DSU.

    ExtractData pipeline calls the DSU ingestion script. This reads the raw copy of the Asset and inserts the information in the tables created by the CreateTables.py script. As well, checks the IsEncrypted field of the Attributes to generate a Spark SQL query that loads those Attributes encrypted into Databricks.


Sidra Ideas Portal


Last update: 2022-09-30
Back to top