Sidra Data Platform (version 2019.R3: Brash Bramley)

released on July 26, 2019

Welcome to the July 2019 release of the Sidra Data Platform. This page documents all the new features, enhancements and visible changes included in the new version 2019.R3: Brash Bramley.

July ’19 Release Overview

This release is a very significant one, as it marks the completion of some very lengthy pieces of work that, although they will not provide new features or capabilities, were required to improve the supportability of the platform in the future. Most important of them all is the huge migration effort to .NET Core, which brings a much simplified deployment and update process based on templates. Also, the adoption of Identity Server ensures that Sidra deployments can integrate with other identity providers other than Azure Active Directory, enabling new control access and data/metadata securitization scenarios that were not possible with the previous model.

Although this is mostly a platform modernization release, we are also including some new functionalities, such as the first release of the Data Catalog API as well.

Before delving into the details, here are some of the key highlights included as Sidra 2019.R3:

Details of what's new in version 2019.R3

Migration to .NET Core

Sidra Data Platform has been upgraded to .NET core. The July '19 release delivers continued efforts into evolving Sidra as a solution for enterprise data lake and with the migration to .NET Core several capabilities and improvements have been added to the platform.

Using .NET Core templates, the deployment processes in Sidra Data Platform have been simplified and with the reduction in size of the packages the entire process is now lightweight and reliable. It is not only a major advantage when agile deployment is done right, but it also leads to fewer failures and quicker recovery time.

.NET Core comes with native support for dependency injection which makes development patterns much easier to implement. This approach allows us to decouple Sidra's modules from their concrete dependencies, improving testability and extensibility of the platform.

The overall performance of the platform is far superior comparing with older frameworks having a better modularity and testability. .Net Core has been designed from the ground up to be cross platform and fast. With this change, Sidra Data Platform enables now the execution on Linux environments and docker/Kubernetes images.

Integration with Identity Server

Security is one of the most important characteristic that we are concerned about. In Sidra Data Platform all layers (from front-end to back-end as well as shared services) have to protect resources and implement an authentication and authorization system. As the platform grows, delegating these fundamental security functions to a centralized service avoids duplication of the functionalities in the applications and normalizes the mechanism to manage the identities of the users.

This release comes with the adoption of Identity Server as identity provider, an OpenID Connect server certified by the OpenID Foundation. We now can support scenarios such as providing third party companies access to parts of Sidra using their own identity providers (from another AAD to a Gmail account) not having to maintain these users on our active directory. This also brings us closer the mid-term goal of being able to create Apps programmatically: stay tuned for more news about this!

Data Catalog API

Sidra Data Platform provides a metadata management service to easily explore, manage and control all the data stored in the platform. This new release comes with a new REST API, offering a simple and straight integration for data discovery as third party. Further releases will come with a UI feature for Data Catalog.

Upgrade to Databricks 5.4

Sidra Databricks clusters have been updated to Databricks Runtime 5.4 so that it can leverage all the new Delta features. More info about Databricks Runtime 5.4, check out Databricks official documentation.

Improved Logging Framework

As part of the platform modernization we migrate to Serilog as logging framework, supporting:

  • multiple sink for writing log events to storage in various formats
  • flexible logging using message templates
  • storing structured event data makes the log searches and analysis possible without log parsing or regular expressions

WebAPI Improvements

With the update to .NET Core, Sidra aims to become an even faster, more robust and flexible data plaftorm based on cutting edge technologies. A series of quality improvements have been made to the WebAPI, and today we are releasing new capabilities to advance this vision:

Split API and Host in different projects
Web API project has been split in two parts, leaving the functionality in the NuGet package and moving out the hosting responsibility to a different project, which is created by the Core template. This allows a larger customization in each deployment, such as: access control provider configuration so that the user can either choose AAD or IS as its identity providers, or setting the logging level and the sinks from a wide catalogue: AppInsights, database, log files, etc.

Usage of dependency.props for package versioning
From now on, all projects get the version of the NuGet packages they refer from a common file. This way all projects use the same version of any package. Sidra packages are also included and will be always updated to the last version.

API versioning
API has been versioned. In this release version 1.0 has been deployed, and other versions can be added in the future if there is need to maintain more than one at the same time.

Support user secrets for local configuration
Developers can now use user secrets to change the appsettings in a folder outside the scope of the repository, without the risk of leaking secrets (like connection strings, passwords and so on) to the repository.

Configuration and Log is only needed in host application
Configuration and Log is configured only in the host application and injected to other classes when needed. Log uses Serilog by default. The log configuration and even the log provider, if anything different from Serilog must be used, can be changed in the host application, which is created by the templates.

Support for different file formats in the data lake

With the release of Databricks Delta, Sidra Data Platform supports Delta format in order to leverage features such as:

  • ACID transactions which comes with serializability, the strongest level of isolation level
  • Time Travel (data versioning) enabling to access earlier version of data for audit or testing scenarios
  • Unified Batch and Streaming Source and Sink
  • Full DML Support (UPDATE, DELETE and MERGE INTO) providing more control of the data

Apart from Delta format, data can be stored in ORC, a highly efficient way to store Hive data (for updated details check official documentation) or Parquet, a compressed, efficient columnar data representation (for updated details check official documentation). File format in data lake can be configured at Entity level so according to the requirements defined, each Entity can have its own.

Ability to consolidate data in data lake

Sidra now allows two modes of store data in data lake: snapshot and consolidation mode.

  • The snapshot mode will append new files to the partition system providing the user to access to the whole history.
  • The consolidation mode will consolidate the data in the data lake based on a Primary Key previously configured.

Online marketing connectors

Sidra provides new connectors to pull data from widely used marketing platforms to extract data. Now, the available connectors are Google Analytics and Bing Ads, expecting new connectors in the near future like LinkedIn API.

SQL Server metadata inference

Sidra Core works with metadata in order to summarize basic information about data, making it easier to find and working with it. Until now, most of the metadata configuration was performed manually during the process of adding new providers. The API has now the capability to infer the Entity metadata from a T-SQL query or a entire SQL Server database, making all this process automatic and creating all the metadata needed as is defined in SQL Server. Although nothing is executed in the source database, access is required to get the schema of the query (or the schema of the entire database).

Note: As of today only SQL Server is supported, but support for more database engines is expected to be added in the future.

Issues fixed in 2019.R3


  • Fixed an error that made the extract from data lake process fail when there is no actual data to be extracted. #66922
  • Fixed an issue that would overwrite a Data Lake table under certain circumstances. #68576
  • Making Query API responses aligned with the REST standard. #73409
  • Making the logging system more robust when logging to AppInsights. #73712

Coming soon: Real-time support with Databricks Delta

What follows is a brief overview into what is next for Sidra Data Platform Release.

Most importantly, we are working on a consolidation of the data synchronization security, which in conjunction with the Identity Server migration, will greatly enhance the security experience of Sidra. We are also working on an overhaul of the existing real-time streaming data ingest, based on the new Databricks Delta format. In addition to that, we will improve the newly released Data Catalog API with Identity Server, provide a new functionality to automatically compute and store differences between load processes, and other minor Quality of Life items, such as default pipelines for SQL Database automatic maintenance.

Feedback

We would love to hear from you! For issues, contact us at info@sidra.dev. You can make a product suggestion, report an issue, ask questions, find answers, and propose new features.