How encrypted data ingestion works¶
The data intake process in Sidra comprises two different copies of the Asset data:
- The raw copy of the Asset, that is stored in an Azure Storage container.
- The optimized storage of the data, that is loaded into Databricks (using Azure Data Lake Storage as storage layer).
When it comes to the encryption of information the following principles apply in Sidra:
- The raw copy will be totally encrypted. The entire file will be encrypted and stored in the Azure Storage container.
- The data loaded into Databricks will be partially encrypted, that is, only those attributes marked with the
IsEncryptedflag will be encrypted.
This section builds upon general Sidra data ingestion process concepts. For more details on those, you can check on the Data ingestion process documentation.
How encryption is configured in the platform¶
The encryption process at intake time uses some information that is configured in the platform:
- In the Sidra Core deployment. Some parameters must be configured in the Azure DevOps release pipeline in order to use encryption.
- In the metadata database. Additional information must be configured in some tables of the Sidra Core metadata database to identify the content that must be encrypted.
Azure DevOps configuration¶
In order to enable any kind of encryption it is necessary to create and populate the following variables:
EncryptionKeyis a 16, 24 or 32 bytes long string, randomly generated.
EncryptionInitializationVectoris a 16 bytes long string, randomly generated.
Both variables must be created in an Azure DevOps library. They will be used during the Release pipeline by the script that configures Databricks cluster.
This cluster configuration script add the variables
EncryptionInitializationVector as the following Databricks secrets:
The information about what Assets, Entities and Attributes need to be encrypted is configured in the Sidra Core metadata database, along with the rest of metadata for those elements.
Asset table contains a field
IsEncrypted that specifies if the Asset is encrypted.
This field is managed by the platform -not meant to be configured by users.
IsEncrypted field will be updated by the data ingestion pipeline after the encryption of the raw copy of the Asset is completed.
Additionally, the same field will also be used to know if decrypting the raw copy when reloading the file -i.e. re-ingesting the file- , or extracting it, is required.
Entity table contains a field
AdditionalProperties, which stores a JSON structure used for adding additional information about the Entity.
THis field can be used to configure encryption for the Assets associated to the Entity.
Such actions can be performed by including the following field in the JSON structure:
1 2 3
This information is used by the data ingestion pipeline to know if an Asset must be encrypted.
Attribute table contains a field
IsEncrypted that users can use to identify which Attributes pertaining to the Entity to which the Asset is associated with, must be encrypted in Databricks.
The Entity that requires encryption must be associated to an encryption ingestion pipeline.
This means that the table
EntityPipeline must contain the association between the Entity and the pipeline
FileIngestionWithEncryption, instead of being associated to the usual
How it works¶
The encryption is performed by the
FileIngestionWithEncryption pipeline, which is launched after the storage of the raw copy of the Asset in the Data Storage Unit (DSU).
This raw copy is not encrypted yet. More information about where that raw copy comes from can be found in the Overview section.
The encryption data ingestion pipeline follows these steps:
Use transfer query to load asset in Databricks¶
The plain -not encrypted- raw copy is used to load the information into Databricks using the transfer query script. More information about how the transfer query works can be found in this section.
The transfer query script is auto-generated specifically for the Entity based on its metadata information.
The transfer query is generated by the
GenerateTransferQuery activity in the ADF pipeline, which checks the
IsEncrypted field of the Attributes to generate a Spark SQL query to load the data into the DSU.
This Spark SQL query loads those attributes already encrypted into Databricks.
After the generation of the transfer query script, this script is executed using a Notebook activity in the pipeline.
Encrypt plain raw copy¶
While the transfer query script is generated and executed, an encrypted version of the raw copy of the Asset is created.
To encrypt the raw copy a Python Notebook activity is used, which in turn executes the
FileEncryption.py notebook included in Sidra Core as part of Sidra deployment.
The location of such Python file is in the
Shared folder inside the Databricks workspace.
FileEncryption.py notebook uses the Databricks secrets configured by the DevOps Release pipeline.
Clean up after process finish¶
When the load and creation of the encrypted raw copy of the data happens in Databricks, the following actions are performed:
The plain raw copy is deleted.
The encrypted raw copy is renamed with the original filename of the plain raw copy.
The asset metadata is marked as encrypted by activating the flag