How encrypted data ingestion works¶
The data ingestion in Sidra comprises two copies of the information:
- The raw copy of the asset that is stored in an Azure Storage container.
- The optimized storage of the information that is loaded into Databricks.
When it comes to the encryption of information:
- The raw copy will be totally encrypted. The entire file will be encrypted and stored in the Azure Storage container.
- The information loaded into Databricks will be partially encrypted, that is, it will be encrypted only those attributes marked with the
How encryption is configured in the platform¶
The information used in the encryption process is provided to the platform:
- In the Sidra Core deployment. Some parameters are required to be configured in the Azure DevOps release pipeline in order to use encryption.
- In the metadata database. Additional information must be configured in some tables of the metadata database to identify the content that must be encrypted.
Azure DevOps configuration¶
In order to enable any kind of encryption it is necessary to create and populate with values the following variables:
EncryptionKeyis a 16, 24 or 32 bytes long string randomly generated.
EncryptionInitializationVectoris a 16 bytes long string randomly generated.
Both variables must be created in an Azure DevOps library and they will be used during the Release pipeline by the script that configures Databricks which will add them as the following Databricks secrets:
The information about what assets, entities and attributes are encrypted is included in the metadata database along with the rest of metadata of those elements.
Asset table contains a field
IsEncrypted that tells if the asset is encrypted. The field is managed by the platform -not meant to be configured by users- and it will be updated by the data ingestion pipeline after the encryption of the raw copy is completed. It will be also used to know if it is required to decrypt the raw copy when reloading the file -i.e. re-ingesting the file- or extracting it.
Entity table contains a field
AdditionalProperties which stores a JSON structure used for adding additional information about the entity. Users can use this field to configure encryption for the assets associated to the entity by including in the field this information:
1 2 3
This information is used by the data ingestion pipeline to know if an asset must be encrypted.
Attribute table contains a field
IsEncrypted that users can use to identify which attributes must be encrypted in Databricks.
This information is used by the
GenerateTransferQuery custom activity to create a transfer query that encrypts those attributes.
The entity that requires encryption must be associated to an encryption ingestion pipeline, that means that the table
EntityPipeline must contains the association between the entity and the pipeline
FileIngestionWithEncryption instead of being associated to the usual
How it works¶
The encryption is performed by the
FileIngestionWithEncryption pipeline which is launched after the storage of the raw copy of the asset in the Data Storage Unit (DSU). This raw copy is not encrypted yet. More information about where that raw copy comes from can be found in the Overview section.
The encryption data ingestion pipeline follows these steps:
Use transfer query to load asset in Databricks¶
The plain -not encrypted- raw copy is used to load the information into Databricks using the transfer query script. More information about how the transfer query works can be found in this section.
The transfer query is generated specifically for the entity and it is based on the metadata information. The transfer query is generated by the
GenerateTransferQuery custom activity that checks the
IsEncrypted field of the attributes to generate a Spark SQL query that loads those attributes encrypted into Databricks. After the generation of the transfer query, it is executed using a Notebook activity in the pipeline.
Encrypt plain raw copy¶
While the transfer query is generated and executed, it is created an encrypted version of the raw copy of the asset. To encrypt the raw copy it is used a Python Notebook activity that executes the
FileEncryption.py notebook included in Sidra and deployed in the
Shared folder in Databricks. This notebook uses the Databricks secrets configured by the DevOps Release pipeline.
Clean up after process finish¶
When the load into Databricks has finished and also the creation of the encrypted raw copy, the following actions are performed:
The plain raw copy is deleted.
The encrypted raw copy is renamed with the original filename of the plain raw copy.
The asset metadata is marked as encrypted by activating the flag