Data Labs Client Application¶
In order to enable the exploratory data analysis, as well as the experimentation with models prior to their productionalization, Sidra incorporates a Client Application template called Data Labs, whose main component is a dedicated Databricks cluster.
This document refers to key concepts of a Client Application in Sidra, which can be reviewed here.
The main purpose of the Data Labs Client Application is to make a sandbox environment available for data analytics and exploratory data analysis use cases. Data Labs deployment includes a dedicated Databricks cluster, separated from the data intake cluster in Core. This way separation of concerns is ensured between the data intake cluster and the Client Application. Such separation ensures as well that data intake performance is not affected. Compliance requirements can also be fulfilled thanks to the ability to define different locations for the clusters. Also, this separation allows to define different cluster specs, for example: while the DSU data intake cluster are an engineering type of cluster, the Data Labs one can be tailored to the business case needs (e.g GPU-enabled, etc).
The Databricks cluster in Data Labs can be used for a myriad of scenarios, through the usage of Databricks notebooks. These scenarios can be:
- Exploratory data analysis, including the visualization capabilities of Databricks notebooks.
- Machine Learning model experiments and training.
- Advanced and multiple paths of analytics and processing, through the orchestration and chaining of logic across several notebooks.
As long as Data Labs is configured to have the required permissions to access the DSU data, this application transparently and automatically retrieves data from the DSU.
Data Labs, as any Client Application integrated with Sidra, shares the common security model with Sidra Core and uses Identity Server for authentication. A copy of the relevant ingested Assets metadata is kept always synchronized with Sidra Core. The metadata synchronization is performed by an automated Sync job, explained here.
The actual data flow orchestration is performed via a specific instance of Azure Data Factory installed in the Client Application. Thanks to the orchestration pipelines, the Delta tables content in the Databricks DSU will be copied to the Client Application directly.
High-level installation details¶
As with any other type of Client Application in Sidra, the process of installing this Client Application consists of the following main steps:
- A dotnet template is installed, which launches a build and release pipeline in Azure DevOps defined for the Data Labs Client Application.
- As part of the build and release pipeline for this Client Application, the needed infrastructure is installed. This includes the execution of the ClientDeploy.ps1 and Databricks deployment scripts, and also the different WebJobs deployment.
Build+Release is performed with multi-stage pipelines, so no manual intervention is required once the template is installed by default.
For more information on these topics you can access the following Documentation.
The Data Labs Client Application resources are contained into a single resource group, separated from the Sidra Core and DSU resource groups. The services included in the ARM template for this Client Application contain the following pieces:
- Storage account: used for storing the copy of the raw data that is extracted from the DSU, and for which the Client Application has access.
- Databricks cluster: used for computing and processing the business logic required on the data copied from the DSU. One or several orchestrator notebooks can be created in order to process the data from the storage.
- Data Factory: used for data orchestration pipelines to bring the data from the DSU to Data Labs.
- Client Database: used for keeping a synchronized copy of the Assets metadata between Sidra Core and Data Labs.
- Key Vault: used for storing and accessing secrets in a secure way.
Besides the Azure infrastructure deployed, several Webjobs are also deployed for Data Labs, responsible for the background tasks of data and metadata synchronization:
Client Application pipelines¶
The pipeline template for a Notebook execution with
ItemId F5170307-D08D-4F92-A9C9-92B30B9B3FF1, can be used to extract content and execute a Databricks Notebook once the raw content is stored in the storage account.
This can be used with DataLab Client Application template.
This pipeline template performs the following steps:
If the notebook execution is succedded, the Assets involved will be update the status to 2; otherwise to 0.
This pipeline template performs the following steps:
- Get the list of Entities to export from the Data Lake.
- For those Entities to copy to storage, an API call to /query Sidra Core API endpoint will copy the data from the DSU to the client storage (raw).
- An entry point notebook, or orchestration notebook, whose name is passed as a parameter to the pipeline, is executed. Here the user will be able to add any custom exploratory data analysis on the data that is in the client raw storage.
- Finally, the Asset status is changed to
ImportedFromDataStorageUnit, or status 3, meaning that the Asset has been imported from the Data Storage Unit into the client database. Otherwise, status will be 0 (error).
The parameters required for using this pipeline are:
orchestratorNotebookPath: The path of the Notebook to execute. The Notebook should be previously created and uploaded into the Databricks instance.
entityName: A parameter to provide the name of the Entity (or empty).
pipelineName: A parameter to provide the name of the pipeline which is being executed (or empty).
pipelineName are parameters that are going to provided to the Notebook, but depending on the Notebook they can be useful or not.
For example the
ExecutionParameters section will be:
Section Client Application pipelines includes information on the available Client Application pipelines to be used for this Client Application.