Sidra Data Platform benefits and advantages¶
This document summarizes the main advantages and benefits of Sidra Data Platform. This is a simplified diagram of Sidra architecture to better understand each of the features explained below:
Automation of ingestion and data movement processes¶
The default interface for the creation of ADF pipelines is a visual designer, which is really useful for experimentation and prototyping scenarios. But for production systems, the cost of manual creation and testing of every pipeline easily adds up. What if the system is loading data from 50 tables, and they're modified to have a new column? That means modifying and testing 50 different pipelines, 50 different transfer queries and 50 different tables, just to get those changes available to query from the Data Lake.
Sidra uses a different approach to this. Instead of relying on manual creation of pipelines through a visual designer, it automates the ingestion and data movement processes based on metadata. In the previous scenario, the only modification required would be adding that new column into the Entity table in the Azure SQL Database CoreDB, and the automated processes will update everything (including the ADF pipelines, the transfer queries and the Azure Data Lake tables). This does not only simplify the process of adding new data sources or modifying the existing ones, but greatly reduces the time it takes to do so when compared with a manual configuration, simplifies testing and reduces the room for errors. One of the companies that has implemented Sidra was facing times of up to one week when configuring a new data source, a task that can be accomplished in less than a day when helped by Sidra.
To simplify this process even further, Sidra offers a web interface for metadata management that allows easy creation, querying and edition of the metadata configured for all the data sources in the platform.
A data platform on the cloud is complex, with many different moving parts. It becomes hard to be on top of it, and being able to monitor what is happening and what may have gone wrong. Even though Azure provides default tools for this purpose, such as the Azure Data Factory dashboards and logs, information is usually broken up among them and makes it hard to see the complete picture. To prevent this, Sidra offers three different tools focused on two different user profiles:
An operational dashboard, that allows for easy diagnosis of the infrastructure status and the progress of the scheduled data loads, as well as any issues that may have occurred.
Integration of all logs generated by either Sidra or any Azure service into Azure Log Analytics, providing a unified platform to analyse and diagnose any issues during the normal operation of the platform.
A Power BI Operational report, generated on the operative data about data intake processes, pipelines etc. This provides a dynamic perspective on operational processes, mostly on data intake.
Audit and data lineage capabilities¶
One of the most requested features of a data platform is related to the lineage of the data. Questions like the source of a specific row of data in a report, or the date when a file was loaded are critical in many business scenarios.
Instead of having to rely on manual exploration from within Azure, Sidra enriches all ingested data with metadata that allows easy tracking of the following details for each specific row:
- Which data source originated it.
- When was it loaded into the Data Lake.
- Which transformations were applied to it, and when were they applied.
This makes it easier to analyse the flow of data within the system and is also an invaluable source of information when diagnosing issues such as missing or malformed data, as checking the loaded data against the source is completely straightforward.
Every project faces issues when managing infrastructure, and cloud projects are no different. Even if the cloud makes it easier to manage and scale the infrastructure requirements of a data platform, there are still hurdles to overcome:
- What happens when the environment needs to be recreated for any reason?
- What about deployments in another region?
- How much time and effort is required for production deployments?
- How to ensure that what was validated in an UAT environment is going to work properly when promoted to production?
Sidra’s Continuous Integration and Continuous Deployment (CI/CD) flows aim at solving all these issues and ensuring that the platform is fast to deploy and stable in all environments.
All Azure infrastructure is deployed through ARM templates, which are parametrized for each environment, and run with minimal human interaction. In the development environment they’re triggered automatically once changes are made, and in UAT and production they are manually triggered to avoid interference with the user actions. The only differences between environments are the parameters provided for each one, such as performance tiers, ensuring that the environments are identical, and no problems arise during the deployment process.
However, infrastructure is not the end of a deployment, as all the processes need to be configured or modified. Pieces such as Azure Data Factory pipelines, or SQL Server Stored Procedures need to be part of the deployment to have a fully functional platform. If those were manually created, they would need to be manually recreated or copied into the new environment, and then validated. By leveraging Sidra automation processes, the same metadata used and validated in one environment can be promoted to another one, providing an assurance that the data processing will behave as expected.
Sidra offers other features that help simplify or accelerate some scenarios, as well as avoid common pitfalls that data platforms in Azure usually encounter.
Pre-packaged Client Applications¶
Since some business scenarios are common across different companies, Sidra comes with two Client Applications already prepared to make the data from the data lake available for exploitation:
- A Data Lab Client Application, that provides end users with data analysis capabilities through Databricks notebooks. This is suitable for exploratory data analysis (EDA) scenarios.
- A simple SQL Database Client Application, which provides a generic template for exposing a tabular model through the automatic creation of staging tables fed from a copy of the data in the data lake.
Additionally Sidra incorporates some foundations for other types of Client Application. These are not today out of the box templates for Client Applications, but , with some minimal development dictated by specific customer business rules, they can be converted to fully-functional Client Applications in more complex business use cases. For example, Sidra incorporates a foundation for Data Quality application, which is designed to perform all required data cleansing and processing operations for data that will be fed back to the data lake and be available for further consumption by other applications.
Several APIs are exposed by Sidra to enable the integration of third-party tools in areas such as Data Catalog or Data Querying.
Real time loading¶
For those scenarios where real time consumption of the data is required, Sidra is prepared to apply all the benefits and advantages previously mentioned by using Databricks Delta as part of a lambda architecture. This ensures data arrives as fast as possible to the real time consumers, while retaining the data ingestion through the lake for later applications.
ML/AI model serving¶
If machine learning is one of the intended uses of the data lake platform, Sidra integrates with MLFlow to provide a model serving platform integrated with the data loading procedures. Also, this platform is used internally to power Sidra’s own anomaly detection and NLP models.
Multiple regions support¶
Sidra makes it easy to deploy different data lakes instances in different regions, to provide an answer to business or compliance requirements. Each separate instance leverages the same data catalog and metadata system, and they have their own separate security rules.
Cost attribution mechanisms¶
During the deployment, all Azure resources are tagged and each application on top of the data lake runs its own Resource Group, granting the business the ability to control and report the costs of each one individually and simplifying cost assignment if required.
Anomaly detection mechanism¶
One of the most time-consuming maintenance activities is ensuring the data has been loaded properly into the system. Even a data flow that completes successfully can hide an issue: maybe the file was empty, or maybe it loaded ten times the usual data volume. The anomaly detection systems of Sidra can help diagnose these issues automatically, ensuring a most robust system with less effort.
Internal data warehouse¶
Sidra has its own internal data warehouse that feeds the operational dashboard, and can be used to support the development of other Power BI dashboards that help visualize usage and status of the platform.
Even though all Azure services have their own security capabilities, the granularity they offer may differ, and it may not reach the required level. Sidra provides a unified end-to-end security vision that allows security to be defined and applied via metadata.