Skip to content

How Excel data intake works

The goal of this document is to describe a specific type of file ingestion from complex-formatted Excel files.

Important Note

Note that this option will be just available for 2022.R1 version and higher.

Introduction

The process for Excel ingestion follows the same steps and involves the same concepts as outlined in the Asset flow into the platform via landing. Complex Excel file ingestion includes the possibility to ingest tables from data that is contained in different sections and sheets of an Excel page. What constitutes each of these tables is defined through as set of specific configuration pre-steps. These configuration pre-steps need to happen for specifying the schema and structure of the Entities that will characterize the metadata for these tables. Some Databricks Notebook artifacts have been created specifically to:

  • Infer the schema of the Excel sheets to generate the schema of each included table. This further informs the way Entities and Views are going to be created, as well as the way the data is going to be extracted.
  • Validate this schema.
  • Generate the related Entities.
  • Ingest the data and create the Databricks tables and views.

The file ingestion process also has a specific implementation, through a separate specific transfer query (or DSU ingestion) script, called ExcelDataIntake.py. This page outlines the complete pre-steps and the data ingestion steps, referencing to the common concepts, but also adding details on this specific file ingestion.

Step-by-step

How Excel data intake works

For more information, check the specific tutorial for Excel data intake .

End-to-end ingestion process

This page starts from pre-step 3 in Asset flows into the platform via landing. We assume that there is an automated process in place to deposit the Excel files in a scheduled way in the landing folder.

The Excel files need to be deposited to the landing zone container, following a set of conventions. Such conventions are to follow a specific file name format, as well as depositing them in a folder path that has been agreed, and which mirrors the structure of Provider/Entities (Asset medatata) in Sidra.

Once the files are in the landing zone, a file ingestion process is triggered every time there is a new drop of data in the landing zone.

The involved pipelines for file ingestion from the landing zone come preinstalled with any Sidra installation. These pipelines are the same as the ones described in here.

The process would be the following:

  1. The pipeline RegisterAsset is triggered, which registers the Asset into the platform. This pipeline also copies the file from the landing zone to an Azure Storage container (raw container) specific in the Data Storage Unit (DSU).

  2. The RegisterAsset calls another pipeline, which is the actual data ingestion pipeline. For the file ingestion from the landing zone, the data ingestion pipeline used is FileIngestionDatabricks pipeline. For the file registration to succeed, the Asset metadata should have been created previously to the first data ingestion. The created Entities should have been associated to the pipeline FileIngestionDatabricks.

  3. The data ingestion pipeline FileIngestionDatabricks then ingests the data into the Data Storage Unit (DSU) in delta Databricks tables. The main difference in the case of Excel ingestion, is that the FileIngestionDatabricks will not execute the general transfer query script as described in Transfer query, but a specific transfer query (or DSU ingestion) script called ExcelDataIntake.py. Details about this script are included in step Sub-step 2.3. File ingestion. This script generates an output JSON data from reading the Excel file, and writes that into the Databricks tables. The data ingestion process reads the input Excel file and the AdditionalProperties asigned to the Entity to which the file belongs and generates a JSON with the actual data structures found in the Excel.

  4. For each sheet, there will be a JSON collection in the output JSON file.

  5. The data ingestion process will create a number of Entities of type View, which will represent the different sheets in the data. These views are related to the parent Entity with a ViewOf type of relationship (type = 4). The SQL query to generate these views is included in the Entity field ViewsDefinition. The Excel Data Intake Process incorporates a script called ViewsDefinition.py for creating these Entities in the Core metadata database and the Views in Databricks as well.

    Info

    You can see a detailed description of this field and the EntityEntity relationships in the Assets metadata.

  6. After the ingestion is finished, the data generates a JSON for each given Entity, so that it is ingested in the data lake in the delta table for the Entity.

  7. Once the data ingestion pipeline has finished, the calling pipeline RegisterAsset deletes the copy of the file from the landing zone.

Step 1. Configure the metadata (Entities, Attributes) for the Data Intake Process

Configuration overview

Note

This step is only required when configuring the Data Intake Process for the first time, or whenever there is any change in the file structures (e.g. added columns).

In the case of complex Excel ingestion, there are additional artifacts created in order to configure the structure of the data inside the Excel documents and their respective Excel sheets, and how these data tables are going to be stored in Databricks. There are several principles and concepts to be considered here:

  1. For each Excel file there will be one main Entity, called parent Entity, which defines the actual schema of that same Excel file. This is the Entity that is processed for file ingestion.

  2. For Assets with different schemas under the same Entity, a specific table AssetAdditionalProperties has been included in order to cover these scenarios. This table contains a Regex, or regular expression field, to map out a specific set of Assets. This table also contains a new field called AdditionalProperties for a specific Asset, that contains the appropriate JSON configuration for reading the schema of the Excel files to be ingested via Landing area. Then:

    • If there is a match between the Asset name and the regular expression (Regex) field in AssetAdditionalProperties, the AdditionalProperties field from this AssetAdditionalProperties table will be applied as an input for specifying the schema of such Asset.
    • If there is not a match between the Asset name and the regular expression (Regex) field in AssetAdditionalProperties, the ExcelDataIntake.py notebook should take the JSON configuration of the AdditionalProperties field at Entitiy level.
  3. This parent Entity will need to have populated AdditionalProperties field with a configuration JSON. This configuration JSON describes the schema for each sheet, namely:

    • The transfer query (or DSU ingestion) script to execute to ingest the data from this Excel file.
    • Which data tables or standalone values are contained inside each sheet, and their location inside the sheet. For example, in a given sheet, we may identify a standalone value and a table. There will therefore be two separate identifiers, one for the standalone value, and another for the table.
    • The schema and data types contained in these tables. For example, the schema of a specific identified table in a given sheet (column names and types). The details on the AdditionalProperties syntax and processing rules are included in the section below.

Metadata configuration

SchemaInference script

In order to extract the schema for the parent Entity and populate its AdditionalProperties field, a SchemaInference script is used in Databricks. This script contains different methods to, depending on the user needs, either infer the schema or validate the schema given by the user of the Excel file.

For the first case mentioned before, below is an usage example to infer the schema of a given Excel portion by this script. The file can be anywhere, in this example we are taking it from Provider's container:

dataFileDir = "https://MYSTORAGEACCOUNT.blob.core.windows.net/MYPROVIDER/MYENTITY/2021/11/19/MYFILE.xlsx"
dataSelection = "'Summary'!A18:E48" #30 rows for test
df = infer_schema(dataFileDir, dataSelection)
display(df)

The output of this will be the following:

  • The actual Excel table read from the specified position in the Excel sheet.
  • A table with the actual schema, e.g.

Excel intake infer schema

The resulting inferred schema can be used as an input to compose the AdditionalProperties JSON field.

For the second case, below is the usage example for the method responsible for validating a given schema against an Excel portion:

dataFileDir = "https://MYSTORAGEACCOUNT.blob.core.windows.net/MYPROVIDER/MYENTITY/2021/11/19/MYFILE.xlsx"
dataSelection = "'Summary'!A19:E48" #Note that we start in the first data row now, as opposed to the header row 
schema = get_schema({
                "Ref": "double",
                "Code": "string",
                "Name": "string",
                "Accural": "double",
                "Cost": "double"
            }.items())

df = validate_schema(dataFileDir, dataSelection, schema)
display(df)

The SchemaInference script is able to infer data types as: "double", "string" and "timestamp". The rest of data types will need to be added manually in the schema.

ExcelIntakeTest script

This script, once executed by the user, will be responsible for the validation of the data extraction configuration -DATA_EXTRACTION_CONFIG value corresponding to the entitySchema field of the AdditionalProperties column in the Entity table- with the Excel file. If the execution of the notebook is successful, then the next script can be used for the metadata creation.

MetadataCreation script

This script is responsible for:

  1. Creating the required metadata in the Sidra Core database (Attributes and Entities).
  2. Generating the child Entinties required to define the view and delta tables from the parent Entity.
  3. Creating the transfer queries and inserting the required delta tables in the DSU.

The MetadataCreation.py script will be located in the Databricks resource in the DSU. The user must populate the Set up variables section in order to create all the required metadata. Below is an usage example to set up the variables in this section:

#===============================> Set up variables <===============================

# If Provider already exists
provider_item_id = "XXXXXXXX-XXXX-XXXX-XXXX-XXXXXXXXXXXX"

# Specify Provider data (name, owner, description and DSU) only when it should be created
# This will be ignored if provider_item_id is specified and the Provider exists
create_provider_data = {
  NAME: "ExcelFilesIngestion",
  OWNER: "johndoe@johndoe.com",
  DESCRIPTION: "Test Provider for Excel files ingestion",
  DSU: 1
}


# Required metadata values
create_entity_data = {
  ENTITY_NAME: "SampleExcel",
  EXCEL_FILE_NAME: "SampleExcelFile",
  FILE_EXTENSION: "xlsx",
  DATA_EXTRACTION_CONFIG: """[
        {
            "identifier": "Dashboard.Dashboard_Cummulative",
            "datalakedata": {
              "type": "view",
              "name": "Dashboard_Cummulative_View"
            },
            "type": "range",
            "range": "S47:V50",
            "schema": {
                "Ref": "string",
                "Planned": "double",
                "Actual": "double",
                "Diff": "double"
            }
        },
        {
            "identifier": "Dashboard.Dashboard_PlannedCompletePercent",
            "datalakedata": {
              "type": "view",
              "name": "Dashboard_PlannedCompletePercent_View"
            },
            "type": "value",
            "cell": "AC39"
        }
    ]"""
}
Where the DATA_EXTRACTION_CONFIG is the value for the entitySchema field of the AdditionalProperties column in the Entity.

Below in the script, the metadata creation and view definition functions will be defined and must be executed (Main cell).

Description of the AdditionalProperties structure

The Excel Data Intake Process makes use of the AdditionalProperties column inside the Entity to specify how to extract the data from the Excel file. The AdditionalProperties is composed of the TransferQuery field and the entitySchema field, which describes the structure of the Excel file schema.

Info

More details about the structure for different Asset schemas in the AdditionalProperties column can be found here.

Below is a sample of a JSON file and a description of each field:

{
    "TransferQuery": "ExcelDataIntake",
    "entitySchema": [
        {
          "identifier": "Prelims.Prelims_table",
          "datalakedata": {
            "type": "view",
            "name": "Prelims_View"
          },
          "type": "range",
          "range": "B17:I170",
          "unpivot": {
              "headerRange": "M16:DK16",
              "headerColumnName": "Dates",
              "dataColumnName": "Totals",
              "extraHeaders": {
                  "15": "DateNumber"
              }
          },
          "schema": {
              "Cost code": "string",
              "Description": "string",
              "Quantity": "double",
              "Unit": "string",
              "Rate": "double",
              "_g17": "string",
              "Option: Add Lump Sum ": "double",
              "Amount": "string"
          }
      },
      {
        "identifiers": [
            "4.01.Subcontractors_Liability_View",
            "4.02.Subcontractors_Liability_View",
            "4.03.Subcontractors_Liability_View"
        ],
        "datalakedata": {
          "type": "view",
          "name": "Subcontractors_Liability_View"
        },
        "identifiersPerSheet": [
          {
            "cell": "B8",
            "columnName": "Subcontractor name"
          },
          {
            "cell": "D13",
            "columnName": "SC Application date"
          }
        ],
        "sections": {
            "id": "Description",
            "possibleValues": {
              "Description": ["Subcontract Order Value (as per Executed Subcontract)", "Non-Recoverable Variations"]
            }
        },
        "type": "range",
        "range": "B11:C32",
        "schema": {
            "Status": "string",
            "Description": "string"
        }
      },
      {
            "identifier": "Dashboard.Dashboard_PlannedCompletePercent",
            "datalakedata": {
              "type": "view",
              "name": "Dashboard_PlannedCompletePercent_View"
            },
            "type": "value",
            "cell": "AC39"
        }
    ]
}
TransferQuery

The TransferQuery field represents the name of the transfer query script to ingest the data, e.g., ExcelDataIntake.

entitySchema

The entitySchema array contains the schema that the parent Entity defines. The schema is defined through different fields exposed below.

1. identifier

The identifier field includes the identifier of a data piece found in an Excel sheet. For example:

    ...
    "identifier": "Prelims.Prelims_table"
    ...
where:

  • The first element is the actual name of the Excel sheet, in this example Prelims.
  • The second element is the name that will be given to the generated view in Sidra, in this example Prelims_table.

Alternatively, we can find an element identifiers, that is a list of identifier fields, e.g.:

    ...
    "identifiers": [
        "4.01.Subcontractors_Liability_View",
        "4.02.Subcontractors_Liability_View",
        "4.03.Subcontractors_Liability_View"
    ]
    ...

In this case, we define a set of sheets inside the Excel that shares the same schema (same structure and configuration, starting from the same position) and store the data from it in the same view/delta table( "identifiersPerSheet" in the example below). So, in the "identifiers" list we add all these sheets whose configuration is the same, and, after that, the configuration just once (table configuration in this case). Also, adding values to the final result from specific cells in each sheet is possible.

    ...
    "identifiers": [
                "4.01.Subcontractors_Liability_View",
                "4.02.Subcontractors_Liability_View",
                "4.03.Subcontractors_Liability_View"
            ],
    "datalakedata": {
    "type": "view",
    "name": "Subcontractors_Liability_View"
    },
    "identifiersPerSheet": [
    {
        "cell": "B8",
        "columnName": "Subcontractor name"
    },
    {
        "cell": "D13",
        "columnName": "SC Application date"
    }
    ]
    ...
If the identifiers array contains only one element, it is considered equivalent to having the above identifier field.

We can also use regular expressions inside this identifier elements. This may be useful if the Excel file contains an indeterminate number of sheets, all sharing the same structure and the same configuration. For all the sheets whose name match the regular expression will apply the same processing rules defined below the identifiers element. This will generate N collections in the output JSON file.

2. type

Three types are possible:

  • value
  • table
  • range

If type = value, the data will be just in one cell. The field cell will contain the cell location inside the Excel Sheet, and the field datatype will contain the type of data inside this cell. For example:

    ...
    {
        "identifier": "Dashboard.Dashboard_PlannedCompletePercent",
        "datalakedata": {
            "type": "view",
            "name": "Dashboard_PlannedCompletePercent_View"
        },
        "type": "value",
        "cell": "AC39"
    }
    ...

If type = table, there need to be additional configuration data that describes either the range of cells where the table starts as well as the schema of the data table:

    {
        "identifier": "Summary.Summary_table",
        "type": "table",
        "range": "A9:P9",
        "includeHeader": "False",
        "maxConsecutiveEmptyRows": 1,
        "schema": {
            "Tab Ref. Cost Code": "string",
            "Name": "string",
            "Trade": "string",
            "Liability": "double",
            "Previous": "double",
            "Difference": "double",
            "Report Cost": "double",
            "Subcon Applied": "double",
            "Subcon Liability": "double",
            "Subcon Accrual": "double",
            "_c10": "string",
            "_c11": "string",
            "_c12": "string",
            "Tender Cost Plan": "double",
            "Check": "double"
        }
    }

where:

  • range: represents the range of cells that contain the first row of the table, e.g., A9:P9. This means that the process will read from this row until the end towards the bottom of the sheet, until there is an empty row. This means that initially, a table can have a variable number of rows, and there could be different tables one after another inside the same sheet, as long as they have some empty rows separating them.
  • includeHeader: represents whether the range includes a header or not for that table.
  • maxConsecutiveEmptyRows: represents the maximum number of consecutive empty rows that the range of cells can contain to be considered a table. If the number of empty rows is bigger than this threshold, any rows below will not be considered part of the same table.
  • schema: represents the list of columns contained in this table, as well as their field types.

If type = range, we specify the column and the row, and therefore the range of cells constituting a table needs to be specified and is fixed.

3. schema

The schema field includes the list with the names of the columns and their types, in consecutive column order as they appear in the Excel sheet. The Excel data intake supports the following types of data:

  • DoubleType: "double".
  • StringType: "string".
  • DateType: "timestamp".
  • IntegerType: "int".
  • LongType: "bigint".
  • DecimalType: "decimal". For this type, there will be several types in order to achieve a greater precision in the extraction:
    • "decimal".
    • "decimal(1)".
    • "decimal(1,2)". An exception will be thrown if the requested Excel type does not exist.

The names of the columns e.g., Name, Trade, do not need to match with the actual names of the columns in the actual Excel file, so there is an implicit mapping, as they are ordered and consecutive.

In the generated JSON by the TransferQuery script, a table will be converted to an array. In the array, there will be complex values with properties, and these properties will be e.g name, Liability, etc., as per the sample above.

In the above example you can see values such as _c10, c_11 or c_12. These are just placeholders to specify that in that column inside a table, there are are empty values (so an empty column in the Excel).

4. datalakedata

The data extracted from the Excel files can be stored in JSON format in the data lake in order to enable SQL queries (or views) to be tapped and be stored as Delta tables. The datalakedata field will include now the type and the name information.

    "identifiers": [
                "4.01.Subcontractors_Liability_View",
                "4.02.Subcontractors_Liability_View",
                "4.03.Subcontractors_Liability_View"
            ],
    "datalakedata": {
    "type": "view",
    "name": "Subcontractors_Liability_View"
    },
    ...
Where:

  • type: makes reference about how the extracted data will be stored, as a view or a delta table.
  • name: name of the view that will include the table or view.

Some considerations to remark:

  • Several fields and tables can be included inside the same view. All the values below the first value where the "datalakedata" is populated are considered part of the same view.
  • After an element with the datalakedata, all subsequent configuration values go against this same view, until you use a different datalakedata.

Step 2. Launch the ingestion process from the Landing zone

To summarize, these are general sub-steps performed as part of the file ingestion process from landing zone in Sidra Core:

  • Sub-step 2.1: File registration.
  • Sub-step 2.2: Storage of the raw copy of the file in the DSU.
  • Sub-step 2.3: File ingestion.

Sub-step 2.1 File registration

The file registration is the process of creating an Asset in the platform representing the file to be ingested in the DSU.
The result of the registration is the population of the correct data in the Sidra Core intake metadata and control tables. This process is the same as in Asset flows into the platform via landing.

Sub-step 2.2 Storage of the raw copy of the file in the DSU

This process is the same as in Asset flows into the platform via landing.

Sub-step 2.3 File ingestion

The file ingestion is the process that reads the raw copy of the file and intakes the information in an optimized format in the DSU, after executing some optimizations.

In this case, ExcelDataIntake.py is the script for writing the generated output JSON into the tables in Databricks.
This specific transfer query (or DSU ingestion) is passed on as a parameter to the FileIngestionDatabricks pipeline.

These are the steps performed inside this script:

  • Read the input Excel file.
  • Generate the output JSON containing the processed data.
  • Load the processed data in delta tables in Databricks.
  • Generate the data preview (so that it is available from the Data Catalog pages) tables in the Sidra Core DB.
  • Generate the Entity views.


Sidra Ideas Portal


Last update: 2022-09-08
Back to top