How encrypted data ingestion works¶
Versions
- This old encryption process has been maintained for retro-compatibility, but with the support of native encryption in Databricks, this old encryption feature in Sidra is no longer supported. A new feature may be incorporated in the roadmap to leverage this new mechanism.
- From version 2021.R1, an additional step has been added to ensure that the unencrypted raw copy is not stored in the same location as the encrypted raw copy. There is a first additional stage where the unencrypted file is moved to a temporary location, and once encrypted, the file is moved to the raw zone.
The data ingestion in Sidra comprises two copies of the information:
- The raw copy of the Asset that is stored in an Azure Storage container.
- The optimized storage of the information that is loaded into Databricks.
When it comes to the encryption of information:
- The raw copy is totally encrypted. The entire file is encrypted and stored in the Azure Storage container.
- The information loaded into Databricks will be partially encrypted, that is, only those Attributes marked with the
IsEncrypted
flag will be encrypted.
How encryption is configured in the platform¶
The information used in the encryption process is provided to the platform via two mechanisms:
- In the Sidra Core deployment: Some parameters need to be configured in the Azure DevOps release pipeline in order to use encryption.
- In the metadata database. Additional information must be configured in some tables of the metadata database to identify the content that must be encrypted.
Both mechanisms are detailed below:
Azure DevOps configuration¶
In order to enable any kind of encryption it is necessary to create and populate the following variables:
EncryptionKey
is a 16, 24 or 32-byte long string, randomly generated.EncryptionInitializationVector
is a 16-byte long string, randomly generated.
Both variables must be created inside an Azure DevOps library.
These variables will be used during the Release pipeline by the script that configures Databricks. The Databricks configuration script will add them as the following Databricks secrets: key
and initialization_vector
respectively.
Metadata configuration¶
The information about what Assets, Entities and Attributes are encrypted is included in the metadata database along with the rest of metadata for those elements.
Entities¶
The Entity
table contains a field AdditionalProperties
which stores a JSON structure used for adding additional information about the Entity.
Users can use this field to configure encryption setting for the Assets associated to the Entity.
The flag for enabling encryption is conveyed through the following line in the JSON:
This information is used by the data ingestion pipeline to know if an Asset must be encrypted.
Attributes¶
The Attribute
table contains a field IsEncrypted
, that users can use to identify which Attributes must be encrypted in Databricks.
This information is used in the legacy transfer query / DSU ingestion to encrypt those Attributes.
Encryption pipeline¶
The Entity that requires encryption must be associated to an encryption ingestion pipeline, that means that the table EntityPipeline
must contains the association between the Entity and the pipeline FileIngestionWithEncryption
instead of being associated to the usual FileIngestionDatabricks
.
How it works¶
The pre-existing FileIngestionWithEncryption
pipeline is removed from this version in favor of using the pipeline FileIngestionDatabricks
.
The following steps are followed:
- Step 1: Move the file to a temporary zone, encrypt and move to raw zone
The RegisterAsset
pipeline moves the file to a temporary zone and, only once the file is encrypted, the file is moved to the raw zone.
To encrypt the raw copy a Python Notebook activity is used, which executes the FileEncryption.py
notebook included in Sidra and deployed in the Shared
folder in Databricks. This notebook uses the Databricks secrets configured by the DevOps Release pipeline.
-
Step 2: Use DSU ingestion to load the Asset into Databricks
After this encryption executes, which is required to move the file to the raw zone, the next stage of the ingestion process involves the execution of the autogenerated transfer query / DSU ingestion which copies the data to the DSU.
The autogenerated transfer query / DSU ingestion decrypts the file in a temporary zone before ingesting the data in the DSU.
ExtractData
pipeline calls the DSU ingestion script. This reads the raw copy of the Asset and inserts the information in the tables created by theCreateTables.py
script. As well, checks theIsEncrypted
field of the Attributes to generate a Spark SQL query that loads those Attributes encrypted into Databricks.