What is Sidra Data Platform¶
Sidra Data Platform, also known as Sidra, is a data lake platform built on Azure PaaS technologies, that provides a solution for enterprise data lake scenarios reducing the time to market, providing scalability and ease of maintenance.
Built on Azure PaaS¶
Sidra is built on Azure PaaS. It is an enterprise data lake solution focused on deploying a working system quickly, facilitating scalability throughout the lifecycle of the platform and simplifying every action related to maintenance.
Sidra is an automated and customizable platform with the capability to process large amounts of data regardless of its source. It offers, among other features, the possibility of storing data in multiple regions in a transparent way, an integrated data catalogue service, data lineage control, consolidated view of logs and audit, as well as a comprehensive set of associated services and extensibility APIs.
AI and Data Governance¶
Sidra provides the common foundation, shared services and governance of the data on which organizations build their specific use cases; from analytical applications based on SQL Server and Power BI, to scenarios of exploratory analysis and generation of machine learning models using Databricks and MLFlow.
- Full deployment in a matter of days
- Automation of data source configuration gets you from zero to data lake in hours
- Modular and adaptable to each scenario (real-time, ML model serving, web interface…)
- The Data Catalog and governance capabilities can help address the data protection regulations challenges
Multimodal storage supporting all types of data sources: from databases and APIs to documents and media files
ML Model Serving Platform
Enable your Data Science teams to build, test and deploy secure models, while keeping track of both code and training data for audit and explainability purposes
Security and Identity
Identity management via Identity Server, allowing secured access to the platform to users with different authentication providers (Azure Active Directory, Google Accounts…)
Data Load ML Models
Pre-packaged models that tackle the most common challenges during the data load process, such as corruption or anomalies in the data set, as well as automatic detection of PII sensitive data.
Integration and Extensibility
APIs for the integration of third-party tools in areas such as Data Catalog or Data Retrieval, as well as Python SDK for Data Scientists.
Data Load Automation
Automation of ETL/ELT process through automatic generation of pipelines for loading, movement and data processing.
Complete Data Catalog with web UI and API access, as well as data lineage audit and traceability.
Batch and Real-time
Support for both batch and real-time data loads, enabling operational data lake scenarios.
Deploy your data lake in less than 24 hours¶
To accelerate the time to analyze, understand and exploit data, it is needed an unified solution to automate data ingestion, catalog, governance and management and for this, Sidra Data Platform provides:
- an integrated platform for everyone who needs to interact with the data
- ability to automate the generation of the whole data pipeline based on data source metadata
- ability to manage all enterprise data sources, giving support for multiple data lake regions where local laws may apply
- a significant reduction of the cost and time spent on custom implementation by providing accelerators to manage ingestion and internal data movement, as well as shared services
- store data of any size, shape, and speed, and perform all types of processing and analytics across platforms and languages
How does it work¶
Sidra orchestration is based on data pipelines, a set of data processing elements connected in series. This approach allows to eliminate many manual steps from the process and enables a smooth, automated end-to-end flow of data. It begins by defining which data is collected, where and how. The solution we look for is to be able to collect data points from many different sources and process the results in near real-time.
The pipelines are made of many stages, defining a stage as a set of actions: the data structured or unstructured arrives to the landing zone from different sources, further on it is registered, stored in a raw storage and validated. As final step, in the Sidra Core data pipeline, the data is stored in an optimized storage from where anyone who needs to interact with the data can have access to.
Why the need of an automated pipeline? Mainly, to have an integrated platform for everyone who needs to access the data and to access the same format of it. The environment where the data pipeline is deployed is reproducible which means it can be replicable by almost anyone in an automated way, avoiding human mistakes. Security and backup systems are significant keys in Sidra Data Platform, but the most significant characteristic is that it can be debugged.
Productivity keys of Sidra Data Platform¶
Highlight one of the Sidra Data Platform productivity keys:
- helps reducing the time to market, providing scalability and ease of maintenance
- centralize the data for the whole organization
- automates metadata capture and catalog
- provides data visibility and availability for everyone who needs to interact with the data
- ensures the agility and flexibility of advanced analytics and business insights that organizations need to support
Sidra can be divided into two main components: core and client applications. Core encompasses a set of shared services like automated ingestion, management UI, audit, lineage, catalog. These services are used by the client applications to retrieve and transform the data as needed.
Core shared services¶
Core is one of the main components of Sidra Data Platform. It orchestrate the entire data pipeline. Sidra Core shared services offers the following properties:
Automated generation of data pipelines for rapid and scalable data ingestion
The number of data pipelines should be able to grow according to the platform needs. That is, the number of data sources to be ingested will make no difference from the time-to-market point of view. As for instance, ingesting 10 or 1000 tables from SQL Server will be the same in terms of deployment effort when using Sidra´s template based generation system.
Comprehensive audit of all the system operations
In Sidra Data Platform we perform advanced lineage tracking of all entities and transformation. This ensure us an independent examination of the software product and its processes.
Time is always an important factor nowadays, everything is build in order to be fast and accurate. For this, when you need to interact and querying the data the feedback should be immediate. Sidra Data Platform offers this feature useful for exploratory purposes and for building data products that need to update in near real-time.
Web-based management UI
In Sidra Data Platform we are building a modern, web-based management UI aiming to fulfill the needs of both developers and administrators. It provides a visual user widget to ingest new data, a dashboard to track the operational status and a central management of the different logs.
A Data Catalog that provides a view of all entities loaded across the different storage regions using a set of services focused on the management and discoverability of the data.
Monitoring is one of the key factors that helps to save money in network performance, productivity and infrastructure costs. The operational activities, such as data loads, in Sidra Data can be monitored using Power BI Dashboards.
Anomaly detection models for the data movement activities For a smooth running data workflow, a robust and stable infrastructure is needed. Anomaly detection models for the data movement activities is an important tool in Sidra, that helps to identify unusual proceedings that can have impact on the process, so the outliers are detected before the data is processed.
Any actor that needs to access the data stored in the data lake for a specific business need is catalogued as client application.
Each Client Application makes use of their own set of tools. It is able to retrieve data from data lake and applying business transforms if needed.
As in Sidra we are concerned about security, the access level for each Client Application can be controlled. That is, if an application needs to set up a sandbox for ML experiments carried out by a third party, the Client Application access can be restricted to non-sensitive data or whatever.
The Client Applications can be deployed in multiple instances and in different geographical zones.