Understand Sidra data lake approach

A detailed introduction to the topic of a data lake as a modern data repository is out of the scope of the current document, but the main idea behind it is to avoid some of the common pitfalls of data products that need to integrate with many data sources by a relaxation of the rules of classical data warehouses.

In opposition to the traditional Data Warehouses, data lakes store the information in the most pure and raw format possible, whether it is structured or unstructured data. This allows to ease the data ingestion logic – shifting the paradigm from ETL to ELTs – and to focus on the usage of this data by each client application, which will deal with the application specific data transformations and validations. It provides the following benefits:

The data needs to be loaded only once, in RAW format, on the system
Each client application might decide to transform this data according to specific business rules or validations, but this operation will happen inside the system, thus limiting the surface area that needs to be dealt with.

Protection against changes to the provider schemas
Even though most – if not all – data within the system will be structured data, there are benefits to be taken from the schema-on-read approach. Most notably, it will help protect the system against changes from provider schemas.

Security and audit benefits
By using the same storage mechanisms for the raw data, and with the current technologies available, it is easier to control the security.

Reduction in storage costs
Data lake approaches are usually based on clustered or cloud file systems. These allow the employment of much cheaper storage than the traditional relational counterparts, which allows to store a huge historical data set in raw format when otherwise would have been economically impractical.

Data lake architecture

The diagram below provides a high-level overview of the architecture:

data-lake-architecture

As can be noted, multiple clients leverage the common storage service to fetch the needed data sets for their operations. Each of these have a separate storage local to the application, which can use a different set of storage and query technologies, that is synchronized from the main repository. In addition to this, there is a set of common services provided through all the system, such as security and auditing.

Data storage (Azure Data Lake Storage)

Azure Data Lake Storage Gen2 (ADLS Gen2) is where all the data for every provider that is added to the system is going to be stored. Then, any client application will be able to request any necessary data to the data lake and make the necessary transformations based on their own business logic.

All the data is stored in an optimized format, by default using ORC, but with support for Parquet and Delta coming soon. This allows the data to be stored in a compressed, well partitioned and indexed structure, offering support for push-down predicate and aggregations in order to increasing the performance.

Data movement (Azure Data Factory)

The data lake will require several ETL/ELT processes to load data from a variety of data sources. By standardizing the way data is ingested in the platform, development efforts to build new ETLs could be reduced.

In addition to that, there is the need of internal data movement for some scenarios like:

  • loading data from the raw storage into the data lake store
  • exporting data from the data lake into an application data mart

Those scenarios are basic processes and can be implemented with a very limited feature set, so the more sensible approach would be to use the same technology for both cases, if possible.

The selected technology for this kind of task is Azure Data Factory V2 (ADF). Azure Data Factory is a globally deployed data movement service in the cloud, which orchestrates and automates the movement and transformation of data both in the cloud and on-premises.

Multiple data lake regions

Sidra supports one or several data lakes in different regions, each with the following:

  • ingestion zone
  • raw storage
  • optimized storage

data-lake-multiple-regions

All the Data Lakes will share the Core common services provided by Sidra. The Sidra Core will contain information about each Data Lake and its configuration, e.g., Azure Storage accounts used and the purpose for each one. This information allows Core to orchestrate the process involving any of the Data Lakes.

Each Data Lake can reside in a different Azure region and those can be different from the one in which Core resides. That regional distribution provides the versatility needed to support business requirements, e.g., the storage facilities for some data in particular must reside in a specific country.

The client apps can be configured to access to one or several Data Lakes.