Sidra Data Platform (version 2019.R2: Alluring Admiral)¶
released on March 12, 2019
Welcome to the March 2019 release of the Sidra Data Platform. This page documents all of the new features, enhancements and visible changes included in the new version 2019.R2: Alluring Admiral.
Details of what's new in version 2019.R2¶
Sidra Data Platform support for Azure Databricks as an analytics service is now out! Azure Databricks provides an end-to-end, managed Apache Spark platform optimized for the cloud. Featuring easy deployment, auto-scaling, flexibility and optimized runtime, Databricks creates a simple and cost-efficient environment to run large-scale Spark workloads. Introducing Azure Databricks as part of Sidra Data Platform offers a flawless collaboration between all the parties that need to interact with the data. Based on Apache Spark, Azure Databricks brings live and shared notebooks, with real-time collaboration so that everyone in the organization can work with the data, connecting to common data sources, run machine learning experiments, perform data transformation process and data quality checks. Easy to use, Azure Databricks comes with the possibility of scheduling cluster up time, its entire deployment being included in ARM packages. Closely integrated with all features of the Azure platform which are also included in the Sidra Data Platform, Azure Databricks supports:
- Azure Storage and Azure Data Lake integration
- Azure PowerBI for interactive visualization connected directly to Databricks cluster in order to query data on a massive scale
- Azure Active Directory to manage the access controls to resources
- Azure SQL Data Warehouse, Azure SQL DB, and Azure CosmosDB
- Flexibility in network topology, it supports deployments in customer VNETs
In the Sidra Data Platform a REST API service layer was included which offered the possibility of interacting with the data managed by the platform. The API is documented using swagger. On a security level, the service layer is protected. Only entities configured in the Azure Active Directory are able to make requests. The API service information covers:
- Clusters status, for both Databricks and HDInsights, provides information such as: the name of the cluster, the status (up, down and creating), upTime and number of nodes.
- Services availability, provides its name and availability for each service.
- Data lake locations, provides the name, location and the size of the storage for each data lake.
- Data loads, provides the volume loaded, validation error and the date for each data load.
- General statistics such as:
- number of registered applications
- stored volume
- total number of entities
- total number of assets loaded
- total number of rows
- number of data lake regions
- total number of providers
- average daily load time in Core
- average daily load size in Core
- Giving an id, the service returns the latest warnings and errors that occurred.
DW data extraction updated¶
Data Warehouse data extraction was updated with:
- core load was split from App loads
- added database server and database to ETL logic
- added load from different schema names
DW database support in DatabaseBuilder¶
In order to automatically deploy the Core.DW database, we have created a code first context. The DatabaseBuilder was updated to support this context as well as the scripts in the folder, with one folder per context.
Methods to wait for the result of a Databricks job¶
We created sync and async methods to internally wait for the result of the Databricks run.
Allows changing of the authentication token in AzureDataBricksHelper¶
From now on, it is possible to change the authentication token in AzureDataBricksHelper. In GenericAPIClient, as well as in AzureDataBricksHelper, a method that allows its configuration has been included. To maximize security, the token is stored in a key vault, therefore the ClusterService was updated to be able to get the token.
New selection methods in EntityRepository¶
We added methods in EntityRepository to get entities based on:
- TableName and the DatabaseName of the provider related to the entity
Naming convention refactor¶
From now on, the metadata used in the Sidra Data Platform has a new name:
- Asset (formerly File)
- Entity (formerly FileType)
- Attribute (formerly FileColumn)
- AttributeFormat (formerly FileColumnFormat)
Sidra Data Platform now contains a testing deployment environment hosted in Azure. The entire environment is deployed using ARM packages.
Issues fixed in 2019.R2¶
- Fixed an issue with spark and metastore incompatibility in some scenarios.
- Fixed an issue when extracting an activity which couldn't resolve the IAzureDataBricksHelper.
- Fixed an issue where GetDataLakes does not return DataLakes with IdLocation or IdClusterType NULL.
- Fixed an issue with inconsistent types in Database.Client and Persistence.Client.** The FileId type is int in the Database but long in the Persistence.Client.
- Fixed an issue where sync packages couldn't update.
- Fixed an issue when FileSize is always set to -1 during the registration of a new file, the API was updated to properly get FileSize.
- Fixed invalid error count in transfer query.
- Fixed transfer queries generated for GenerateTransferQuery custom activity that throws an exception if file has 0 bytes.
- Fixed AzureDataBricksHelper to use GenericAPIClient.
- RetryImport Web Job failing due to cast exception.
- DataflowCustomActivity Unity parameters order was modified.
- DatabaseBuilder successfully executes the scripts in consumer applications during deployment process.
- When installing the Deployment.Azure package, the scripts are not being copied correctly when the target project is not an AzureResourceGroup.
- Custom activity successfully generates the script for table creation based on the metadata.
- Customize Power BI tenantId parameter in Power BI helper in order to compose authorityUri string.
We would love to hear from you! For issues, contact us at email@example.com. You can make a product suggestion, report an issue, ask questions, find answers, and propose new features.