What is Sidra Data Platform¶
Sidra Data Platform, also known as Sidra, is a data lake platform built on Azure PaaS technologies, that provides a solution for enterprise data lake scenarios reducing the time to market, providing scalability and ease of maintenance.
To accelerate the time to analyze, understand and exploit data, it is needed an unified solution to automate data ingestion, catalog, governance and management and for this, Sidra Data Platform provides:
- an integrated platform for everyone who needs to interact with the data
- ability to automate the generation of the whole data pipeline based on data source metadata
- ability to manage all enterprise data sources, giving support for multiple data lake regions where local laws may apply
- a significant reduction of the cost and time spent on custom implementation by providing accelerators to manage ingestion and internal data movement, as well as shared services
- store data of any size, shape, and speed, and perform all types of processing and analytics across platforms and languages
How does it work¶
Sidra orchestration is based on data pipelines, a set of data processing elements connected in series. This approach allows to eliminate many manual steps from the process and enables a smooth, automated end-to-end flow of data. It begins by defining which data is collected, where and how. The solution we look for is to be able to collect data points from many different sources and process the results in near real-time.
The pipelines are made of many stages, defining a stage as a set of actions: the data structured or unstructured arrives to the landing zone from different sources, further on it is registered, stored in a raw storage and validated. As final step, in the Sidra Core data pipeline, the data is stored in an optimized storage from where anyone who needs to interact with the data can have access to.
Why the need of an automated pipeline? Mainly, to have an integrated platform for everyone who needs to access the data and to access the same format of it. The environment where the data pipeline is deployed is reproducible which means it can be replicable by almost anyone in an automated way, avoiding human mistakes. Security and backup systems are significant keys in Sidra Data Platform, but the most significant characteristic is that it can be debugged.
Productivity keys of Sidra Data Platform¶
Highlight one of the Sidra Data Platform productivity keys:
- helps reducing the time to market, providing scalability and ease of maintenance
- centralize the data for the whole organization
- automates metadata capture and catalog
- provides data visibility and availability for everyone who needs to interact with the data
- ensures the agility and flexibility of advanced analytics and business insights that organizations need to support
Sidra can be divided into two main components: core and client applications. Core encompasses a set of shared services like automated ingestion, management UI, audit, lineage, catalog. These services are used by the client applications to retrieve and transform the data as needed.
Core shared services¶
Core is one of the main components of Sidra Data Platform. It orchestrate the entire data pipeline. Sidra Core shared services offers the following properties:
Automated generation of data pipelines for rapid and scalable data ingestion
The number of data pipelines should be able to grow according to the platform needs. That is, the number of data sources to be ingested will make no difference from the time-to-market point of view. As for instance, ingesting 10 or 1000 tables from SQL Server will be the same in terms of deployment effort when using Sidra´s template based generation system.
Comprehensive audit of all the system operations
In Sidra Data Platform we perform advanced lineage tracking of all entities and transformation. This ensure us an independent examination of the software product and its processes.
Time is always an important factor nowadays, everything is build in order to be fast and accurate. For this, when you need to interact and querying the data the feedback should be immediate. Sidra Data Platform offers this feature useful for exploratory purposes and for building data products that need to update in near real-time.
Web-based management UI
In Sidra Data Platform we are building a modern, web-based management UI aiming to fulfill the needs of both developers and administrators. It provides a visual user widget to ingest new data, a dashboard to track the operational status and a central management of the different logs.
A Data Catalog that provides a view of all entities loaded across the different storage regions using a set of services focused on the management and discoverability of the data.
Monitoring is one of the key factors that helps to save money in network performance, productivity and infrastructure costs. The operational activities, such as data loads, in Sidra Data can be monitored using Power BI Dashboards.
Anomaly detection models for the data movement activities For a smooth running data workflow, a robust and stable infrastructure is needed. Anomaly detection models for the data movement activities is an important tool in Sidra, that helps to identify unusual proceedings that can have impact on the process, so the outliers are detected before the data is processed.
Any actor that needs to access the data stored in the data lake for a specific business need is catalogued as client application.
Each Client Application makes use of their own set of tools. It is able to retrieve data from data lake and applying business transforms if needed.
As in Sidra we are concerned about security, the access level for each Client Application can be controlled. That is, if an application needs to set up a sandbox for ML experiments carried out by a third party, the Client Application access can be restricted to non-sensitive data or whatever.
The Client Applications can be deployed in multiple instances and in different geographical zones.