Skip to content

How Excel data intake works

The goal of this document is to describe a specific type of file ingestion from complex-formatted Excel files.

Important Note

Note that this option will be just available for 1.11 version and higher.

Introduction

The process for Excel ingestion follows the same steps and involves the same concepts as outlined in the Asset flow into the platform via landing. However, there are a set of specific configuration pre-steps that need to happen for configuring the schema and structure of the Entities. Some Databricks Notebook artifacts have been created specifically to:

  • Infer the schema of the Excel sheets to generate the Schema, which further informs the way Entities and Views are going to be created, as well as the way the data is going to be extracted.
  • Validate this schema.
  • Generate the related Entities.
  • Ingest the data and create the Databricks tables and views.

The file ingestion process also has a specific implementation, through a separate specific transfer query script, called ExcelDataIntake.py. This page outlines the complete pre-steps and the data ingestion steps, referencing to the common concepts, but also adding details on this specific file ingestion.

Info

For more information, check the specific tutorial for Excel data intake

End-to-end ingestion process

This page starts from pre-step 3 in Asset flows into the platform via landing. We assume that there is an automated process in place to deposit the Excel files in a scheduled way in the landing folder.

The files need to be deposited to the landing zone following a set of conventions, like a specific file naming convention, and in a folder path that has been agreed and mirrors the structure of Providers/ Entities (Asset metadata) in Sidra.

Once the files are in the landing zone, a file ingestion process is triggered every time there is a new drop of data in the landing zone.

The involved pipelines for file ingestion from the landing zone come preinstalled with any Sidra installation. These pipelines are the same as the ones described in here.

The process would be the following:

  1. The pipeline RegisterAsset is triggered, which registers the Asset into the platform. This pipeline also copies the file from the landing zone to an Azure Storage container (raw container) specific in the Data Storage Unit (DSU).

  2. The RegisterAsset calls another pipeline, which is the actual data ingestion pipeline. For the file ingestion from the landing zone, the data ingestion pipeline used is FileIngestionDatabricks pipeline. For the file registration to succeed, the Asset metadata should have been created, and the created Entities will need to be associated to the pipeline FileIngestionDatabricks.

  3. The data ingestion pipeline FileIngestionDatabricks then ingests the data into the Data Storage Unit (DSU) in delta Databricks tables. The main difference in the case of Excel ingestion, is that the FileIngestionDatabricks will not execute the general transfer query script as described in Transfer query, but a specific transfer query script called ExcelDataIntake.py. Details about this script are included in step Sub-step 1.3. File ingestion. This script generates an output JSON data from reading the Excel file, and writes that into the Databricks tables. The data ingestion process reads the input Excel file and the additionalProperties asigned to the Entity to which the file belongs and generates a JSON with the actual data structures found in the Excel.

  4. For each sheet, there will be a JSON collection in the output JSON file.

  5. The data ingestion process will create a number of Entities of type View, which will represent the different sheets in the data. These views are related to the parent Entity with a ViewOf type of relationship (type = 4). The SQL query to generate these views is included in the Entity field ViewsDefinition. The Excel data intake process incorporates a script called ViewsDefinition.py for creating these Entities in the Core metadata database and the Views in Databricks as well.

    Info

    You can see a detailed description of this field and the EntityEntity relationships in the Assets metadata.

  6. After the ingestion is finished, the data generates a JSON for each given Entity, so that it is ingested in the data lake in the delta table for the Entity.

  7. Once the data ingestion pipeline has finished, the calling pipeline RegisterAsset deletes the copy of the file from the landing zone.

Pre-step 1. Configure the metadata (Entities, Attributes) for the data intake process

Note

This step is only required when configuring the data intake process for the first time, or whenever there is any change in the file structures (e.g. added columns).

In the case of Excel ingestion, there are additional artifacts created in order to configure the structure of the data inside the Excel documents and their respective Excel sheets, and how this data is going to be stored in Databricks. There are several principles and concepts to be considered here:

  1. For each Excel file there will be one main Entity, called parent Entity, which defines the actual schema of all the contained Excel sheets. This is the Entity that is processed for file ingestion.

  2. This parent Entity will need to have populated additionalProperties field with a configuration JSON. This configuration JSON describes the schema for each sheet, namely:

    • The transfer query script to execute to ingest the data from this Excel file.
    • Which data tables or standalone values are contained inside each sheet, and their location inside the sheet. For example, in a given sheet, we may identify a standalone value and a table. There will therefore be two separate identifiers, one for the standalone value, and another for the table.
    • The schema and data types contained in these tables. For example, the schema of a specific identified table in a given sheet (column names and types). The details on the additionalProperties syntax and processing rules are included in the section below.

Before starting the configuration of the Excel data intake the following steps need to be followed:

  1. Create the Provider, as any other Provider in Sidra. See tutorial pages Add new Provider, Add new Entity.

  2. Create parent Entity and populate its additionalProperties field with the schema of the different data elements inside the Excel sheets. There is one parent Entity per Excel file, not per Excel sheet. Each Excel sheet is just one collection inside the output JSON after processing each Excel file. For populating the additionalProperties field, there is a script called SchemaInference.py for aiding in the extraction of the schema, and the validation of a given schema of the Excel file. See section below for more details about this script.

  3. Associate the parent Entity with the data ingestion pipeline responsible for ingesting the data (FileIngestionDatabricks). See page Associate Entity to pipeline for more details.

  4. Creating the associated views as related Entities. For creating these views, there is a script called ViewsDefiniton.py. See section below for more details about this script.

SchemaInference script

In order to extract the schema for the parent Entity and populate its additionalProperties field, a SchemaInference script is used in Databricks. This script contains different methods to, depending on the user needs, either infer the schema or validate the schema given by the user of the Excel file.

For the first case mentioned before, below is an usage example to infer the schema of a given Excel portion by this script. The file can be anywhere, in this example we are taking it from Provider's container:

dataFileDir = "https://MYSTORAGEACCOUNT.blob.core.windows.net/MYPROVIDER/MYENTITY/2021/11/19/MYFILE.xlsx"
dataSelection = "'Summary'!A18:E48" #30 rows for test
df = infer_schema(dataFileDir, dataSelection)
display(df)

The output of this will be the following:

  • The actual Excel table read from the specified position in the Excel sheet.
  • A table with the actual schema, e.g.

Excel intake infer schema

The resulting inferred schema can be used as an input to compose the additionalProperties JSON field.

For the second case, below is the usage example for the method responsible for validating a given schema against an Excel portion:

dataFileDir = "https://MYSTORAGEACCOUNT.blob.core.windows.net/MYPROVIDER/MYENTITY/2021/11/19/MYFILE.xlsx"
dataSelection = "'Summary'!A19:E48" #Note that we start in the first data row now, as opposed to the header row 
schema = get_schema({
                "Ref": "double",
                "Code": "string",
                "Name": "string",
                "Accural": "double",
                "Cost": "double"
            }.items())

df = validate_schema(dataFileDir, dataSelection, schema)
display(df)

ViewsDefinition script

This script is responsible for inserting of all the views that result from the configuration of a parent Entity. This means, adding entries for the new Entities in the Sidra Core metadata database, as well as adding the Attributes for such Entities (also in the Sidra Core metadata database).

This script generates automatically inserts for the Entity and Attribute tables for these views.

Below is an usage example to generate the views for a given parent Entity:

parentEntityName = 'MyEntity'
parentEntityTable = 'MyTable'
parentEntityItemId = '63F8BD10-6A4C-45C7-B36D-021181F229A4' #ItemId on Entity table
providerDatabaseName = 'myproviderdb'
storageName = 'mystorage'

apjson = json.loads(additionalproperties)

result = generate_view_definition_and_attributes(parentEntityName, parentEntityItemId, parentEntityTable, providerDatabaseName, storageName)
print(result)

with  open("/dbfs/mnt/views.txt", "w") as f:
  f.write(result)

The additionalproperties field needs to be populated as a variable before running the above code snippet.

The output of this is a snippet of SQL code to insert the data in these database tables:

  • Insert Attributes into Attributes table.
  • Insert Entities into Entity table, one per view.
  • Inside each of these Entities, populate the field ViewDefinition with the SQL query to generate this data.

Description of the additionalProperties structure

The Excel data intake process makes use of the additionalProperties field inside the Entity to specify how to extract the data from the Excel file.

This file has a specific structure. Below is a sample of this JSON file and a description of each field:

        {
            "identifier": "Summary.Summary_contract",
            "type": "value",
            "viewname": "Summary_table",
            "cell": "A2",
            "datatype": "STRING"
        },
        {
            "identifier": "Summary.Summary_description",
            "type": "value",
            "cell": "C2",
            "datatype": "STRING"
        },
        {
            "identifier": "Summary.Summary_table",
            "type": "table",
            "range": "A9:P9",
            "includeHeader": "False",
            "maxConsecutiveEmptyRows": 1,
            "schema": {
                "Tab Ref. Cost Code": "string",
                "Name": "string",
                "Trade": "string",
                "Liability": "double",
                "Previous": "double",
                "Difference": "double",
                "Report Cost": "double",
                "Subcon Applied": "double",
                "Subcon Liability": "double",
                "Subcon Accrual": "double",
                "_c10": "string",
                "_c11": "string",
                "_c12": "string",
                "Tender Cost Plan": "double",
                "Check": "double"
            }
        }

TransferQuery

The TransferQueryfield represents the name of the transfer query script to ingest the data, e.g., ExcelDataIntake.

entitySchema

The entitySchema array contains the schema that the parent Entity defines.

identifier

The identifier field includes the identifier of a data piece found in an Excel sheet. For example:

      "identifier": "Summary.Summary_description",
      "type": "value",
      "cell": "C2",
      "datatype": "STRING"
where:

  • The first element is the actual name of the Excel sheet, in this example Summary.
  • The second element is the name that will be given to the generated view in Sidra, in this example Summary_description.

Alternatively, we can find an element identifiers, that is a list of identifier fields, e.g.

      "identifiers":  [
      "1 Wages.Wages_table",  
      "2 Labour.Labour_table",
      "3 Materials.Materials_table"
      ]

In this case, we define a set of Sheets inside the Excel that shares the same schema.

Example

"entitySchema"
"identifiers":  [
    "1 Wages.Wages_table",  
    "2 Labour.Labour_table",
    "3 Materials.Materials_table"
    ]
"type": "table",
"range":"A7:U7",
"schema": {
                "Tab Ref. Cost Code": "string",
                "Name": "string",
                "Trade": "string",
                "Liability": "double",
                "Previous": "double",
                "Difference": "double",
            ....
}
}

In this case above, we have three sheets that contain the exact same structure inside of tables (and therefore share the same configuration), where the table element starts exactly in the same position, and share schemas. So, in the "identifiers" list we add all these sheets whose configuration is the same, and, after that, the configuration just once (table configuration in this case).

If the identifiers array contains only one element, it is considered equivalent to having the above identifier field.

We can also use regular expressions inside this identifier elements. This may be useful if the Excel file contains an indeterminate number of sheets, all sharing the same structure and the same configuration. For all the sheets whose name match the regular expression will apply the same processing rules defined below the identifiers element.

This will generate N collections in the output JSON file.

type

Three types are possible:

  • value
  • table
  • range

If type = value, the data will be just in one cell. The field cell will contain the cell location inside the Excel Sheet, and the field datatype will contain the type of data inside this cell. For example:

{
      "identifier": "Summary.Summary_description",
      "type": "value",
      "cell: "C2",
      "datatype": "STRING"
}

If type = table, there need to be additional configuration data that describes either the range of cells where the table starts as well as the schema of the data table:

    {
        "identifier": "Summary.Summary_table",
        "type": "table",
        "range": "A9:P9",
        "includeHeader": "False",
        "maxConsecutiveEmptyRows": 1,
        "schema": {
            "Tab Ref. Cost Code": "string",
            "Name": "string",
            "Trade": "string",
            "Liability": "double",
            "Previous": "double",
            "Difference": "double",
            "Report Cost": "double",
            "Subcon Applied": "double",
            "Subcon Liability": "double",
            "Subcon Accrual": "double",
            "_c10": "string",
            "_c11": "string",
            "_c12": "string",
            "Tender Cost Plan": "double",
            "Check": "double"
    }

where:

  • range: represents the range of cells that contain the first row of the table, e.g., A9:P9. This means that the process will read from this row until the end towards the bottom of the sheet, until there is an empty row. This means that initially, a table can have a variable number of rows, and there could be different tables one after another inside the same sheet, as long as they have some empty rows separating them.
  • includeHeader: represents whether the range includes a header or not for that table.
  • maxConsecutiveEmptyRows: represents the maximum number of consecutive empty rows that the range of cells can contain to be considered a table. If the number of empty rows is bigger than this threshold, any rows below will not be considered part of the same table.
  • schema: represents the list of columns contained in this table, as well as their field types.

If type = range, we specify the column and the row, and therefore the range of cells constituting a table needs to be specified and is fixed.

schema

The schema field includes the list with the names of the columns and their types, in consecutive column order as they appear in the Excel sheet.

The names of the columns e.g., Name, Trade, do not need to match with the actual names of the columns in the actual Excel file, so there is an implicit mapping, as they are ordered and consecutive.

In the generated JSON by the TransferQuery script, a table will be converted to an array. In the array, there will be complex values with properties, and these properties will be e.g name, Liability etc. as per the sample above.

In the above example you can see values such as _c10, c_11 or c_12. These are just placeholders to specify that in that column inside a table, there are are empty values (so an empty column in the Excel).

viewname

This value represents the name of the view that will include the table or value. Several fields and tables can be included inside the same view. All the values below the first value where the "viewname" is populated are considered part of the same view.

Example

{
            "identifier": "Summary.Summary_contract",
            "type": "value",
            "viewname": "Summary_table",
            "cell": "A2",
            "datatype": "STRING"
        },
        {
            "identifier": "Summary.Summary_description",
            "type": "value",
            "cell": "C2",
            "datatype": "STRING"
        },
        {
            "identifier": "Summary.Summary_table",
            "type": "table",
            "range": "A9:P9",
            "includeHeader": "False",
            "maxConsecutiveEmptyRows": 1,
            "schema": {
                "Tab Ref. Cost Code": "string",
                "Name": "string",
                "Trade": "string",
                "Liability": "double",
                "Previous": "double",
                "Difference": "double",
                "Report Cost": "double",
                "Subcon Applied": "double",
                "Subcon Liability": "double",
                "Subcon Accrual": "double",
                "_c10": "string",
                "_c11": "string",
                "_c12": "string",
                "Tender Cost Plan": "double",
                "Check": "double"
        }
}

All three identifiers of data: Summary_contract, Summary_description and Summary_table are considered part of the same view, Summary_table. So, after an element with the viewname, all subsequent configuration values go against this same view, until you use a different viewname.

Step 1. Launch the ingestion process from the Landing zone

To summarize, these are general sub-steps performed as part of the file ingestion process from landing zone in Sidra Core:

  • Sub-step 1.1: File registration.
  • Sub-step 1.2: Storage of the raw copy of the file in the DSU.
  • Sub-step 1.3: File ingestion.

Sub-step 1.1 File registration

The file registration is the process of creating an Asset in the platform representing the file to be ingested in the DSU.
The result of the registration is the population of the correct data in the Sidra Core intake metadata and control tables. This process is the same as in Asset flows into the platform via landing.

Sub-step 1.2 Storage of the raw copy of the file in the DSU

This process is the same as in Asset flows into the platform via landing.

Sub-step 1.3 File ingestion

The file ingestion is the process that reads the raw copy of the file and intakes the information in an optimized format in the DSU, after executing some optimizations.

In this case, ExcelDataIntake.py is the script for writing the generated output JSON into the tables in Databricks.
This specific transfer query is passed on as a parameter to the FileIngestionDatabricks pipeline.

These are the steps performed inside this script:

  • Read the input Excel file.
  • Generate the output JSON containing the processed data.
  • Load the processed data in delta tables in Databricks.
  • Generate the data preview (so that it is available from the Data Catalog pages) tables in the Sidra Core DB.
  • Generate the Entity views.

Last update: 2022-03-08
Back to top