How encrypted data ingestion works

The data intake process in Sidra comprises two different copies of the Asset data:

  • The raw copy of the Asset, that is stored in an Azure Storage container.
  • The optimized storage of the data, that is loaded into Databricks (using Azure Data Lake Storage as storage layer).

When it comes to the encryption of information the following principles apply in Sidra:

  • The raw copy will be totally encrypted. The entire file will be encrypted and stored in the Azure Storage container.
  • The data loaded into Databricks will be partially encrypted, that is, only those attributes marked with the IsEncrypted flag will be encrypted.

This section builds upon general Sidra data ingestion process concepts. For more details on those, you can check on the Data ingestion process documentation.

How encryption is configured in the platform

The encryption process at intake time uses some information that is configured in the platform:

  1. In the Sidra Core deployment. Some parameters must be configured in the Azure DevOps release pipeline in order to use encryption.
  2. In the metadata database. Additional information must be configured in some tables of the Sidra Core metadata database to identify the content that must be encrypted.

Azure DevOps configuration

In order to enable any kind of encryption it is necessary to create and populate the following variables:

  • EncryptionKey is a 16, 24 or 32 bytes long string, randomly generated.
  • EncryptionInitializationVector is a 16 bytes long string, randomly generated.

Both variables must be created in an Azure DevOps library. They will be used during the Release pipeline by the script that configures Databricks cluster. This cluster configuration script add the variables EncryptionKey and EncryptionInitializationVector as the following Databricks secrets: key and initialization_vector, respectively.

Metadata configuration

The information about what Assets, Entities and Attributes need to be encrypted is configured in the Sidra Core metadata database, along with the rest of metadata for those elements.

Assets

The Asset table contains a field IsEncrypted that specifies if the Asset is encrypted. This field is managed by the platform -not meant to be configured by users.

The IsEncrypted field will be updated by the data ingestion pipeline after the encryption of the raw copy of the Asset is completed.

Additionally, the same field will also be used to know if decrypting the raw copy when reloading the file -i.e. re-ingesting the file- , or extracting it, is required.

Entities

The Entity table contains a field AdditionalProperties, which stores a JSON structure used for adding additional information about the Entity.

THis field can be used to configure encryption for the Assets associated to the Entity.

Such actions can be performed by including the following field in the JSON structure:

1
2
3
{
    "isEncrypted": true
}

This information is used by the data ingestion pipeline to know if an Asset must be encrypted.

Attributes

The Attribute table contains a field IsEncrypted that users can use to identify which Attributes pertaining to the Entity to which the Asset is associated with, must be encrypted in Databricks.

This information is used by the GenerateTransferQuery ADF activity to create a transfer query script that encrypts those Attributes.

Encryption pipeline

The Entity that requires encryption must be associated to an encryption ingestion pipeline.

This means that the table EntityPipeline must contain the association between the Entity and the pipeline FileIngestionWithEncryption, instead of being associated to the usual FileIngestionDatabricks.

How it works

The encryption is performed by the FileIngestionWithEncryption pipeline, which is launched after the storage of the raw copy of the Asset in the Data Storage Unit (DSU). This raw copy is not encrypted yet. More information about where that raw copy comes from can be found in the Overview section.

The encryption data ingestion pipeline follows these steps:

Use transfer query to load asset in Databricks

The plain -not encrypted- raw copy is used to load the information into Databricks using the transfer query script. More information about how the transfer query works can be found in this section.

The transfer query script is auto-generated specifically for the Entity based on its metadata information.

The transfer query is generated by the GenerateTransferQuery activity in the ADF pipeline, which checks the IsEncrypted field of the Attributes to generate a Spark SQL query to load the data into the DSU.

This Spark SQL query loads those attributes already encrypted into Databricks.

After the generation of the transfer query script, this script is executed using a Notebook activity in the pipeline.

Encrypt plain raw copy

While the transfer query script is generated and executed, an encrypted version of the raw copy of the Asset is created.

To encrypt the raw copy a Python Notebook activity is used, which in turn executes the FileEncryption.py notebook included in Sidra Core as part of Sidra deployment. The location of such Python file is in the Shared folder inside the Databricks workspace.

The FileEncryption.py notebook uses the Databricks secrets configured by the DevOps Release pipeline.

Clean up after process finish

When the load and creation of the encrypted raw copy of the data happens in Databricks, the following actions are performed:

  1. The plain raw copy is deleted.

  2. The encrypted raw copy is renamed with the original filename of the plain raw copy.

  3. The asset metadata is marked as encrypted by activating the flag IsEncripted in the Asset table.