Databricks cluster

A detailed introduction to Databricks is out of the scope of the current document, but here it can be found the key concepts to understand the rest of the documentation provided about Sidra platform. Additional information can be found in the official Databricks documentation website.

What is Azure Databricks

Azure Databricks is an Apache Spark Analytics platform optimized for Azure, it is compatible with other Azure services like SQL Data Warehouse, Power BI, Azure Active Directory or Azure Storage.

As any other Azure resource, it can be created from the Azure Portal or using the Azure Resource Manager (ARM) by means of ARM templates.

Workspace

A workspace is an environment for accessing all the user's Azure Databricks objects. The workspace organizes those objects into folders. Additionally, there are two special folders:

  • Shared is for sharing objects across the organization
  • Users contains a folder for each registered user

From the Azure Portal, it can be launched the Azure Databricks UI -a web portal- with the user workspace selected. The same user from the Azure portal is used in Azure Databricks UI due to the integration with Azure Active Directory

azure-databricks-ui

The objects that can be contained in a folder are Dashboards, Libraries, Experiments, Notebooks and Databricks File System (DBFS), being the two lasts the most interesting for Sidra:

  • A Notebook is a web-based interface to documents that contain runnable commands, visualizations, and narrative text. When the Sidra wants to run a script in Databricks, it uses a Notebook that references the script.

  • Databricks File System (DBFS) is filesystem abstraction layer over a blob store. Sidra platform uses DBFS to store the script that will be executed by a Notebook.

Data Management

The objects that hold the data in which it is performed analytics are Database, Table, Partition and Metastore. Sidra creates the appropriate database, tables and partitions for the data based on the Metadata stored in Core about that data and using two scripts:

  • CreateTableScript creates the database, tables and partitions.

  • TransferQuery insert the data in the objects created by CreateTableScript.

Computation Management

Clusters and Jobs are the concepts related to running analytic computations in Azure Databricks.

  • Cluster is set of computation resources and configurations on which you run notebooks and jobs. The cluster is automatically started when a job is executed and can be configured to terminate after a period of inactivity, so it allows to reduce the cost of the resource.

  • Job is a non-interactive mechanism for running a notebook. Since a notebook is web-based UI, Sidra uses jobs to run notebooks.