Skip to content

Scenarios for Excel Data Intake Process with complex files

The goal of this document is to describe different scenarios from complex-formatted Excel files and the configuration for a proper data extraction.

Scenario 1. Support of sections on tables

Whenever a table in an Excel file contains a section (image below), this will be transformed as an additional column with the name of the section as a value.

support-sections-excel

This transformation can be implemented following the next JSON structure for the AdditionalProperties field:

{
    "TransferQuery": "ExcelDataIntake",
    "entitySchema": [
        {
            "identifier": "Accounting Journal.AccountingJournal_table",
            "type": "range",
            "range": "B3:N136",
            "sections": {
                "id": "Account",
                "possibleValues": {
                    "P&L /Balance Sheet": [""],
                    "Trial Balance Debit": [""],
                    "Trial Balance Credit": [""],
                    "Adjusting Entries Debit": [""], 
                    "Adjusting Entries Credit": [""],
                    "Adjusted Trial Balance Debit": [""],
                    "Adjusted Trial Balance Credit": [""],
                    "Income Statement Debit": [""], 
                    "Income Statement Credit": [""],
                    "Balance Sheet Debit": [""],
                    "Balance Sheet Credit": [""]
                },
                "nonPossibleValues": {
                    "Account": "risk"
                }
            },
            "schema":{
                "Code":"double",
                "Account": "string", 
                "P&L /Balance Sheet": "string",
                "Trial Balance Debit": "double",
                "Trial Balance Credit": "double",
                "Adjusting Entries Debit": "double", 
                "Adjusting Entries Credit": "double",
                "Adjusted Trial Balance Debit": "double",
                "Adjusted Trial Balance Credit": "double",
                "Income Statement Debit": "double", 
                "Income Statement Credit": "double",
                "Balance Sheet Debit": "double",
                "Balance Sheet Credit": "double"
            }
        }
    ]
}

Where:

  • id: name of the column with the section identifier.
  • possibleValues: columns and possible values in the header row.
  • nonPossibleValues: columns and non possible values in the header row to exclude.

Also, the sections can be specified in order to have a more controlled result.

{
    "TransferQuery": "ExcelDataIntake",
    "entitySchema": [
        {
            "identifier": "Accounting Journal.AccountingJournal_table",
            "type": "range",
            "range": "B3:N136",
            "sections": {
                "id": "Account",
                "possibleValues": {
                    "Account": [
                        "Sales",
                        "Discounts",
                        "Marketing"
                    ]
                }
            },
            "schema":{
                "Code":"double",
                "Account": "string", 
                "P&L /Balance Sheet": "string",
                "Trial Balance Debit": "double",
                "Trial Balance Credit": "double",
                "Adjusting Entries Debit": "double", 
                "Adjusting Entries Credit": "double",
                "Adjusted Trial Balance Debit": "double",
                "Adjusted Trial Balance Credit": "double",
                "Income Statement Debit": "double", 
                "Income Statement Credit": "double",
                "Balance Sheet Debit": "double",
                "Balance Sheet Credit": "double"
            }
        }
    ]
}

Scenario 2. Support totals on tables

When ingesting the data in Databricks, the "Total" records like in the Excel table shown above will be stored in a different JSON table in Databricks with the corresponding section value. The table will be named, in our example, as AccountingJournal_table_totals. This behaviour will be automatically implemented so there is no need to add it in the AdditionalProperties field. However, it is recommendable to use the following filter avoiding the load of incorrect "Total" registers (skipTotals).

support-sections-excel

{
    "TransferQuery": "ExcelDataIntake",
    "entitySchema": [
        {
            "identifier": "Accounting Journal.AccountingJournal_table",
            "type": "range",
            "range": "B3:N136",
            "sections": {
                "id": "Account",
                "possibleValues": {
                    "Account": [
                        "Sales",
                        "Discounts",
                        "Marketing"
                    ]
                },
                "nonPossibleValues": {
                    "Account": "risk"
                }
            },
            "skipTotals": {
            "Description": ["Total Packaging"]
            },
            "schema":{
                "Code":"double",
                "Account": "string", 
                "P&L /Balance Sheet": "string",
                "Trial Balance Debit": "double",
                "Trial Balance Credit": "double",
                "Adjusting Entries Debit": "double", 
                "Adjusting Entries Credit": "double",
                "Adjusted Trial Balance Debit": "double",
                "Adjusted Trial Balance Credit": "double",
                "Income Statement Debit": "double", 
                "Income Statement Credit": "double",
                "Balance Sheet Debit": "double",
                "Balance Sheet Credit": "double"
            }
        }
    ]
}

Scenario 3. Mapping same schema and different column names

As described above, when having two or more sheets in an Excel file that contain the exact same structure starting from the very same position, they can be specified in the identifiers field and share the schema declared in the JSON of the AdditionalProperties field of Entities. In the case when there are more sheets with an identical schema but where just the name of some columns is changed, we can specify it in the JSON as follows. This way, there would not be need to generate different items in metadata and the data would be in the same tables (identifiersColumnMapping).

{
    "TransferQuery": "ExcelDataIntake",
    "entitySchema": [
        {
            "identifier": [ 
                "Accounting Journal.AccountingJournal_table",
                "Accounting Journal Section A.AccountingJournalSectionA_table"
            ],
            "identifiersColumnMapping":
            {
              "AccountingJournalSectionA_table": {
                "Account": "Account Section A",
                "P&L /Balance Sheet": "Balance Sheet"
              }
            },
            "type": "range",
            "range": "B3:N136",
            "sections": {
                "id": "Account",
                "possibleValues": {
                    "Account": [
                        "Sales",
                        "Discounts",
                        "Marketing"
                    ]
                },
                "nonPossibleValues": {
                    "Account": "risk"
                }
            },
            "skipTotals": {
            "Description": ["Total Packaging"]
            },
            "schema":{
                "Code":"double",
                "Account": "string", 
                "P&L /Balance Sheet": "string",
                "Trial Balance Debit": "double",
                "Trial Balance Credit": "double",
                "Adjusting Entries Debit": "double", 
                "Adjusting Entries Credit": "double",
                "Adjusted Trial Balance Debit": "double",
                "Adjusted Trial Balance Credit": "double",
                "Income Statement Debit": "double", 
                "Income Statement Credit": "double",
                "Balance Sheet Debit": "double",
                "Balance Sheet Credit": "double"
            }
        }
    ]
}

Scenario 4. Adding data from headers

When an Excel table has several headers as the image shown below, the rows related to the headers will pivot converting in columns so the data can be stored together. Thus, it is not necessary to declare a rigid schema regarding a variable number of columns as, in this example, the number of months or weeks.

support-sections-excel

Our example for the JSON would be the next one:

{
    "TransferQuery": "ExcelDataIntake",
    "entitySchema": [
        {
            "identifier": "Accounting Journal.AccountingJournalYear_table",
            "type": "range",
            "range": "B4:J25",
            "unpivot": {
                "headerRange": "D3:J3",
                "headerColumnName": "Dates",
                "dataColumnName": "Totals",
                "extraHeaders": {
                    "2":"MonthNumber"
                }
            }
            "sections": {
                "id": "Account",
                "possibleValues": {
                    "Account": [
                        "Sales",
                        "Discounts",
                        "Marketing"
                    ]
                },
                "nonPossibleValues": {
                    "Account": "risk"
                }
            },
            "skipTotals": {
            "Description": ["Total Packaging"]
            },
            "schema":{
                "Code":"double",
                "Account": "string"
            }
        }
    ]
}

Where:

  • headerRange: is the row range with all columns to unpivot (data range will be automatically detected).
  • headerColumnName: name of the column that will contain all unpivoted columns (default value: UnpivotHeaderColumn).
  • dataColumnName: name of the column that will contain the data unpivoted (default value: Total).
  • extraHeaders: "row number": "unpivoted column name" will indicate extra rows to be unpivoted apart from the one specified in the headerRange, the name of the unpivoted column should be also specified (range will be the same than headerRange but changing the row number).

Scenario 5. Different schemas per Asset in same Entity

In the case that the user needs to ingest some file (Asset) in the same Entity (already created) and the Asset's schema differs from the one described in the Entity, we can identify specific AdditionalProperties in the table [DataIngestion].[AssetAdditionalProperties] with a concrete RegularExpression which identify them.

More information about this can be found here.

The configuration of this AdditionalProperties per Asset can be done through SQL script in the database or by the recommendable way, using the endpoint from Sidra API: /api/metadata/Assets/additionalProperty. Below there is an example code of the request body to fill. Take into account that depending on if inserting, the id will be non-required, and, if updating, the id will be required:

{
  "id": 0,
  "idEntity": 0,
  "assetNameRegex": "string",
  "additionalProperties": "string",
  "order": 0
}


Sidra Ideas Portal


Last update: 2022-07-25
Back to top