Skip to content

Data Labs Data Product

In order to enable the exploratory data analysis, as well as the experimentation with models prior to their productionalization, Sidra incorporates a Data Product template called Data Labs, whose main component is a dedicated Databricks cluster.

This document refers to key concepts of a Data Product in Sidra, which can be reviewed here.

Purpose

The main purpose of the Data Labs Data Product is to make a sandbox environment available for data analytics and exploratory data analysis use cases. Data Labs deployment includes a dedicated Databricks cluster, separated from the data intake cluster in Core. This way separation of concerns is ensured between the data intake cluster and the Data Product. Such separation ensures as well that data intake performance is not affected. Compliance requirements can also be fulfilled thanks to the ability to define different locations for the clusters. Also, this separation allows to define different cluster specs, for example: while the DSU data intake cluster are an engineering type of cluster, the Data Labs one can be tailored to the business case needs (e.g GPU-enabled, etc).

The Databricks cluster in Data Labs can be used for a myriad of scenarios, through the usage of Databricks notebooks. These scenarios can be:

  • Exploratory data analysis, including the visualization capabilities of Databricks notebooks.
  • Machine Learning model experiments and training.
  • Advanced and multiple paths of analytics and processing, through the orchestration and chaining of logic across several notebooks.

As long as Data Labs is configured to have the required permissions to access the DSU data, this application transparently and automatically retrieves data from the DSU.

Data Labs, as any Data Product integrated with Sidra, shares the common security model with Sidra Core and uses Identity Server for authentication. A copy of the relevant ingested Assets metadata is kept always synchronized with Sidra Core. The metadata synchronization is performed by an automated Sync job, explained here.

The actual data flow orchestration is performed via a specific instance of Azure Data Factory installed in the Data Product. Thanks to the orchestration pipelines, the Delta tables content in the Databricks DSU will be copied to the Data Product directly.

High-level installation details

As with any other type of Data Product in Sidra, the process of installing this Data Product consists of the following main steps:

  • A dotnet template is installed, which launches a build and release pipeline in Azure DevOps defined for the Data Labs Data Product.
  • As part of the build and release pipeline for this Data Product, the needed infrastructure is installed. This includes the execution of the ClientDeploy.ps1 and Databricks deployment scripts, and also the different WebJobs deployment.

Build+Release is performed with multi-stage pipelines, so no manual intervention is required once the template is installed by default.

For more information on these topics you can access the following Documentation.

Architecture

The Data Labs Data Product resources are contained into a single resource group, separated from the Sidra Core and DSU resource groups. The services included in the ARM template for this Data Product contain the following pieces:

  • Storage account: used for storing the copy of the raw data that is extracted from the DSU, and for which the Data Product has access.
  • Databricks cluster: used for computing and processing the business logic required on the data copied from the DSU. One or several orchestrator notebooks can be created in order to process the data from the storage.
  • Data Factory: used for data orchestration pipelines to bring the data from the DSU to Data Labs.
  • Client Database: used for keeping a synchronized copy of the Assets metadata between Sidra Core and Data Labs.
  • Key Vault: used for storing and accessing secrets in a secure way.

Data Labs ARM template

Besides the Azure infrastructure deployed, several Webjobs are also deployed for Data Labs, responsible for the background tasks of data and metadata synchronization:

  • Sync
  • DatabaseBuilder
  • DatafactoryManager

Data Product pipelines

The pipeline template for a Notebook execution with ItemId F5170307-D08D-4F92-A9C9-92B30B9B3FF1, can be used to extract content and execute a Databricks Notebook once the raw content is stored in the storage account. This can be used with DataLab Data Product template.

This pipeline template performs the following steps:

If the notebook execution is succedded, the Assets involved will be update the status to 2; otherwise to 0.

This pipeline template performs the following steps:

  • Get the list of Entities to export from the Data Lake.
  • For those Entities to copy to storage, an API call to /query Sidra Core API endpoint will copy the data from the DSU to the client storage (raw).
  • An entry point notebook, or orchestration notebook, whose name is passed as a parameter to the pipeline, is executed. Here the user will be able to add any custom exploratory data analysis on the data that is in the client raw storage.
  • Finally, the Asset status is changed to ImportedFromDataStorageUnit, or status 3, meaning that the Asset has been imported from the Data Storage Unit into the client database. Otherwise, status will be 0 (error).

The parameters required for using this pipeline are:

  • orchestratorNotebookPath: The path of the Notebook to execute. The Notebook should be previously created and uploaded into the Databricks instance.
  • entityName: A parameter to provide the name of the Entity (or empty).
  • pipelineName: A parameter to provide the name of the pipeline which is being executed (or empty).

Both entityName and pipelineName are parameters that are going to provided to the Notebook, but depending on the Notebook they can be useful or not.

For example the ExecutionParameters section will be:

{
    "orchestratorNotebookPath": "/Shared/MyNotebook",
    "entityName": "myentity",
    "pipelineName": "mypipeline"
}

Section Data Product pipelines includes information on the available Data Product pipelines to be used for this Data Product.


Last update: 2023-06-22