Table creation script

The process for ingesting a file into Databricks is performed by an Azure Data Factory (ADF) pipeline that generates and executes two scripts:

  • The table creation script. It creates the necessary database and tables in the Databricks cluster.
  • The transfer query script. It reads the raw copy of the asset and insert the information in the tables created by the previous script.

The table creation is a Spark SQL script specifically generated for each entity and when it is executed it creates in the Databricks cluster the following structures:

  • A database for the provider
  • A table for the entity
  • A validation error table for the entity

That means that all the entities associated to the same provider will share the same database but each entity will have its own table and validation error table.

The table creation script is generated and executed by the Python Notebook CreateTables.py which is stored in the Shared folder in Databricks.

NOTE:

In previous versions of Sidra, the table creation script was generated by a custom activity named CreateTableScript which stores it in Databricks File System (DBFS) and then it was executed by a Notebook activity.

Provider database

The following pseudo-code is a sample of the part of the Spark SQL query that creates the database:

1
CREATE DATABASE IF NOT EXISTS {Provider.DatabaseName};

So, it will use the value in the DatabaseName column of the Provider as the database name.

Entity table

This is a sample of the part of the Spark SQL query that creates the table:

1
2
3
4
5
6
7
8
9
CREATE TABLE IF NOT EXISTS {Provider.DatabaseName}.{Entity.TableName} (
{Attribute1.Name} {Attribute1.HiveType},
{Attribute2.Name} {Attribute2.HiveType},
...)
USING {Entity.Format}
PARTITIONED BY (
{AttributeWithIsPartitionColumn1.Name},
{AttributeWithIsPartitionColumn2.Name},
...)

So, it will create a table with the name configured in the TableName column of the entity. That table will have a column for each Attribute associate to the entity. The name of the column will be the Name of the Attribute and the type, the HiveType of the Attribute. It will be used the Format of the Entity which currently it can be ORC, Parquet or Delta.

Finally, for each Attribute associate to the entity that has the IsPartitionColumn to true, it will be generated a partition with the Name of the Attribute.

Entity validation error table

The validation error table is quite similar to the cluster table except for the following differences:

  • The name of the table has the suffix validationerrors appended, e.g. if the Entity TableName is sample the validation error table name will be samplevalidationerrors.
  • The type of the non-calculated columns is STRING instead of the HiveType of the Attribute. Non-calculated columns can be identified for having the IsCalculated column configured to 0.
  • The validation error table contains an additional column: ViolatedConstraints of the Hive type STRING. It helps to debug which constraints have been violated during the load of the asset and report the issue.

An entity sample

The relation between the information stored in Core database and the data structures generated in Databricks can be seen in the following simplified sample:

Provider table

Id ProviderName DatabaseName
1 Sales Department sales

Entity table

Id IdProvider Name TableName TableFormat
10 1 Employee employee Parquet

Attribute table

IdEntity Name HiveType IsCalculated IsPartitionColumn
10 Name STRING 0 0
10 Age INT 0 0
10 FileDate DATE 1 1
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
CREATE DATABASE IF NOT EXISTS sales;

CREATE TABLE IF NOT EXISTS sales.employee (
    'Name' STRING,
    'Age' INT,
    'FileDate' DATE)
USING PARQUET
PARTITIONED BY ('filedate')

CREATE TABLE IF NOT EXISTS sales.employeevalidationerrors (
    'Name' STRING,
    'Age' STRING,
    'ViolatedConstraints' STRING,
    'FileDate' DATE)
USING PARQUET
PARTITIONED BY ('filedate')