Table creation script¶

The process of ingesting a file into Databricks is performed by an Azure Data Factory (ADF) pipeline that generates and executes two scripts:

The table creation script creates the necessary database and tables in the Databricks cluster.
The DSU ingestion notebook reads the raw copy of the Asset and inserts the information in the tables created by the previous script.

The table creation script is a Spark SQL script specifically generated for each Entity and when it is executed it creates in the Databricks cluster the following structures:

A database for the Provider.
A table for the Entity.

That means that all the Entities associated to the same Provider will share the same database but each Entity will have its own table.

The table creation script in the SDK (datalake.py) is called by the Python Notebook CreateTables.py, which is stored in the DSU Ingestion folder in Databricks.

Provider database¶

The following pseudo-code is a sample of the part of the Spark SQL query that creates the database:

CREATE DATABASE IF NOT EXISTS {Provider.DatabaseName};

So, it will use the value in the DatabaseName column of the Provider as the database name.

Entity table¶

This is a sample of the part of the Spark SQL query that creates the table:

CREATE TABLE IF NOT EXISTS {Provider.DatabaseName}.{Entity.TableName} (
{Attribute1.Name} {Attribute1.HiveType},
{Attribute2.Name} {Attribute2.HiveType},
...)
USING {Entity.Format}
PARTITIONED BY (
{AttributeWithIsPartitionColumn1.Name},
{AttributeWithIsPartitionColumn2.Name},
...)

So, a table is created with the name configured in the TableName column of the Entity.

That table will have a column for each Attribute associated to the Entity. The name of the column will be the Name of the Attribute. The type will be the HiveType of the Attribute. As format, the Format of the Entity will be used. Currently, Format can be ORC, Parquet or Delta.

Finally, for each Attribute associated to the Entity that has the IsPartitionColumn to true, a partition will be be generated with the Name of the Attribute.

An Entity sample¶

The relation between the information stored in Core database and the data structures generated in Databricks can be seen in the following simplified sample:

Provider table¶

Id	ProviderName	DatabaseName
1	Sales Department	sales

Entity table¶

Id	IdProvider	Name	TableName	TableFormat
10	1	Employee	employee	Parquet

Attribute table¶

IdEntity	Name	HiveType	IsCalculated	IsPartitionColumn
10	Name	STRING	0	0
10	Age	INT	0	0
10	FileDate	DATE	1	1

CREATE DATABASE IF NOT EXISTS sales;

CREATE TABLE IF NOT EXISTS sales.employee (
    'Name' STRING,
    'Age' INT,
    'FileDate' DATE)
USING PARQUET
PARTITIONED BY ('filedate')