Skip to content

Understand Sidra data lake approach

A detailed introduction to the topic of a data lake as a modern data repository is out of the scope of the current document, but the main idea behind it is to avoid some of the common pitfalls of data products that need to integrate with many data sources by a relaxation of the rules of classical data warehouses.

In opposition to the traditional Data Warehouses, data lakes store the information in the most pure and raw format possible (the concept of inmutable data lake), whether it is structured or unstructured data. This allows to ease the data ingestion logic – shifting the paradigm from ETL to ELTs – and to focus on the usage of this data by each Data Product, which will deal with the application specific data transformations and validations.

Sidra data lake approach provides the following benefits:

The data needs to be loaded only once, in RAW format, on the system
Each Data Product might decide to transform the data in the data lake according to specific business rules or validations. This operation will happen inside the Sidra system, completely integrated with overall Sidra services and metadata model, thus limiting the surface area that needs to be dealt with.

Protection against changes to the provider schemas
Even though most – if not all – data within the system will be structured data, there are benefits to be taken from the schema-on-read approach. Most notably, it will help protect the system against changes from provider schemas.

Security and audit benefits
By using the same storage mechanisms for the raw data, and with the current technologies available, it is easier to control the security.

Reduction in storage costs
Data lake approaches are usually based on clustered or cloud file systems. These allow the employment of much cheaper storage than the traditional relational counterparts, which allows to store a huge historical data set in raw format when otherwise would have been economically impractical.

Data lake architecture

The diagram below provides a high-level overview of Sidra data lake architecture approach:

data-lake-architecture

As can be noted, multiple clients leverage the common storage service to fetch the needed data sets for their operations. Each of these have a separate storage local to the application, which can use a different set of storage and query technologies, that is synchronized from the main repository. In addition to this, there is a set of common services provided through all the system, such as security and auditing.

Data storage (Azure Data Lake Storage)

Azure Data Lake Storage Gen2 (ADLS Gen2) is where all the data for every Provider that is added to the system is going to be stored. Then, any Data Product will be able to request any necessary data to the data lake, and make the necessary transformations based on their own business logic. Data Products thus behave as groups of resources that speak to Sidra data lake to transparently keep the data synchronized and available for data exploitation use cases.

All the data is stored in an optimized Databricks Delta format. This allows the data to be stored in a compressed, well partitioned and indexed structure, offering support for push-down predicate and aggregations in order to increase the performance.

Data movement (Azure Data Factory)

The data lake will require several ETL/ELT processes to load data from a variety of data sources. By standardizing the way data is ingested in the platform, development efforts to build new ETLs could be reduced.

In addition to that, there is the need of internal data movement for some scenarios like:

  • loading data from the raw storage into the data lake store
  • exporting data from the data lake into an application data mart

Those scenarios are basic processes and can be implemented with a very limited feature set, so the more sensible approach would be to use the same technology for both cases, if possible.

The selected technology for this kind of task is Azure Data Factory V2 (ADF). Azure Data Factory is a globally deployed data movement service in the cloud, which orchestrates and automates the movement and transformation of data both in the cloud and on-premises.

Multiple data lake regions

Sidra supports one or several Data Storage Units (DSUs). Each DSU in Sidra is basically a resource group integrated with Sidra Service services, and which includes all types of servies related to data intake and data processing in Sidra.

Each DSU is independently deployable, and could be installed each in a different region if required. Each DSU contains the following parts:

  • An ingestion zone (or landing zone)
  • A raw storage
  • An optimized storage (ADLS Gen 2 storage)

data-lake-multiple-regions

All the DSUs will share the Sidra Service common services provided by Sidra. The Sidra Service will contain information about each DSU and its configuration, e.g., Azure Storage accounts used and the purpose for each one. This information allows Sidra Service to orchestrate the process involving any of the DSUs.

Each DSU can reside in a different Azure region and can be different from the region in which Sidra Service resides. That regional distribution provides the versatility needed to support business requirements, e.g., the storage facilities for some data in particular must reside in a specific country.

The Data Products can be configured to access the data stored in one or several DSUs.


Last update: 2023-07-07