How encrypted data ingestion works

The data ingestion in Sidra comprises two copies of the information:

  • The raw copy of the asset that is stored in an Azure Storage container.
  • The optimized storage of the information that is loaded into Databricks.

When it comes to the encryption of information:

  • The raw copy will be totally encrypted. The entire file will be encrypted and stored in the Azure Storage container.
  • The information loaded into Databricks will be partially encrypted, that is, it will be encrypted only those attributes marked with the IsEncrypted flag.

How encryption is configured in the platform

The information used in the encryption process is provided to the platform:

  1. In the Sidra Core deployment. Some parameters are required to be configured in the Azure DevOps release pipeline in order to use encryption.
  2. In the metadata database. Additional information must be configured in some tables of the metadata database to identify the content that must be encrypted.

Azure DevOps configuration

In order to enable any kind of encryption it is necessary to create and populate with values the following variables:

  • EncryptionKey is a 16, 24 or 32 bytes long string randomly generated.
  • EncryptionInitializationVector is a 16 bytes long string randomly generated.

Both variables must be created in an Azure DevOps library and they will be used during the Release pipeline by the script that configures Databricks which will add them as the following Databricks secrets: key and initialization_vector respectively.

Metadata configuration

The information about what assets, entities and attributes are encrypted is included in the metadata database along with the rest of metadata of those elements.

Assets

The Asset table contains a field IsEncrypted that tells if the asset is encrypted. The field is managed by the platform -not meant to be configured by users- and it will be updated by the data ingestion pipeline after the encryption of the raw copy is completed. It will be also used to know if it is required to decrypt the raw copy when reloading the file -i.e. re-ingesting the file- or extracting it.

Entities

The Entity table contains a field AdditionalProperties which stores a JSON structure used for adding additional information about the entity. Users can use this field to configure encryption for the assets associated to the entity by including in the field this information:

1
2
3
{
    "isEncrypted": true
}

This information is used by the data ingestion pipeline to know if an asset must be encrypted.

Attributes

The Attribute table contains a field IsEncrypted that users can use to identify which attributes must be encrypted in Databricks.

This information is used by the GenerateTransferQuery custom activity to create a transfer query that encrypts those attributes.

Encryption pipeline

The entity that requires encryption must be associated to an encryption ingestion pipeline, that means that the table EntityPipeline must contains the association between the entity and the pipeline FileIngestionWithEncryption instead of being associated to the usual FileIngestionDatabricks.

How it works

The encryption is performed by the FileIngestionWithEncryption pipeline which is launched after the storage of the raw copy of the asset in the Data Storage Unit (DSU). This raw copy is not encrypted yet. More information about where that raw copy comes from can be found in the Overview section.

The encryption data ingestion pipeline follows these steps:

Use transfer query to load asset in Databricks

The plain -not encrypted- raw copy is used to load the information into Databricks using the transfer query script. More information about how the transfer query works can be found in this section.

The transfer query is generated specifically for the entity and it is based on the metadata information. The transfer query is generated by the GenerateTransferQuery custom activity that checks the IsEncrypted field of the attributes to generate a Spark SQL query that loads those attributes encrypted into Databricks. After the generation of the transfer query, it is executed using a Notebook activity in the pipeline.

Encrypt plain raw copy

While the transfer query is generated and executed, it is created an encrypted version of the raw copy of the asset. To encrypt the raw copy it is used a Python Notebook activity that executes the FileEncryption.py notebook included in Sidra and deployed in the Shared folder in Databricks. This notebook uses the Databricks secrets configured by the DevOps Release pipeline.

Clean up after process finish

When the load into Databricks has finished and also the creation of the encrypted raw copy, the following actions are performed:

  1. The plain raw copy is deleted.

  2. The encrypted raw copy is renamed with the original filename of the plain raw copy.

  3. The asset metadata is marked as encrypted by activating the flag IsEncripted in the Asset table.