Skip to content

How Excel data intake works

In this tutorial we detail how to configure a Sidra Provider, Entity and the Attributes to load Excel files in the platform through an example.

Introduction

The process of Excel data intake can be carried out by using Sidra API or an SQL script directly on the Sidra Core metadata database. A sample SQL script is provided to apply such configuration.

As opposed to database Providers, as there is not an automatic metadata inference pipeline yet available, we need to manually specify each of the Entities that are needed to be loaded, as well as their Attributes.

Info

The Data Intake Process for file ingestion from the landing zone is described conceptually in the section Data intake via landing zone.

Configuration for data intake from Excel

There are several ways to configure this kind of intake of data:

  • Configuration through the API.
  • Configuration through SQL script (recommendable).

1. Configuration through the API

The following calls to the API can be performed through Swagger or PowerShell:

  • Endpoint /api/metadata/providers: used to create the Provider.
  • Endpoint /api/metadata/entities: used to create the Entities.
  • Endpoint /api/metadata/attributes: used to create the Attributes associated to an Entity.
  • Endpoint /api/metadata/entities/{idEntity}/pipelines/{idPipeline}: used to associate an Entity with a pipeline.

Info

For more information, you can see the following detailed pages:

You can find more information about the Sidra metadata system in Assets metadata.

2. Configuration through SQL script (example)

The SQL script detailed below presents a set of variables at the start that are needed to create the Sidra metadata objects. By default, the Entities will be associated with the pipeline FileIngestionDatabricks.

Next, we will detail the variables that are needed to be specified following a given example. It would correspond to an Excel file with the below structure:

table of example

Step 1. Provider configuration and creation

We will need to configure and create a Provider in the SQL server where we have the Core database:

-------------------------------------------------------------------
-- PROVIDER CONFIGURATION
-------------------------------------------------------------------

DECLARE @ProviderItemId UNIQUEIDENTIFIER = NEWID()
DECLARE @ProviderName VARCHAR(MAX)= 'MyFirstExcelIngestion'
DECLARE @ProviderDatabaseName VARCHAR(MAX)= 'myfirstexcelingestion' -- lower case
DECLARE @ProviderDescription VARCHAR(MAX) = 'Attempt to ingest specific data from the AccountingJournal Excel file sample'
DECLARE @ProviderOwner NVARCHAR(MAX) = N'John.Doe@PlainConcepts.com'
DECLARE @IdDSU INT = 1

-------------------------------------------------------------------
-- PROVIDER CREATION
-------------------------------------------------------------------

    INSERT INTO [DataIngestion].[Provider]
        ([ItemId], [ProviderName], [DatabaseName], [Description], [Owner], [IdDataStorageUnit], [ParentSecurityPath], 
        [CreationDate])
    VALUES
        (@ProviderItemId, @ProviderName, @ProviderDatabaseName, @ProviderDescription, @ProviderOwner, @IdDSU, '1', 
        GETUTCDATE())
-------------------------------------------------------------------
-- PROVIDER INFORMATION CHECKING
-------------------------------------------------------------------

SELECT * FROM DataIngestion.Provider ORDER BY Id desc

Step 2. Parent Entity metadata configuration and creation

Step 2.1. Entity configuration and creation

In the case of the Entities associated to this Provider, we will need to specify some additional data.

For example, we need to specify the file format (in this case it will be .xlsx), the target DSU (Data Storage Unit) and the regular expression, used to link the files deposited in the landing zone with the corresponding Entity.

The regular expression described in the example below identifies a file with the following name. You may download the file to view the data:

AccountingJournal_2022-03-01_12-00-00.xlsx

This expression needs always to match the name of the document, so that Sidra can identify the latest Assets for that Entity.

The sample SQL script would look similar to this one:

-------------------------------------------------------------------
-- ENTITY CONFIGURATION
-------------------------------------------------------------------

DECLARE @EntityName VARCHAR(MAX) = 'AccountingJournalEntity'
DECLARE @FileFormat VARCHAR(MAX) = 'xlsx' -- Options: file formats of Excel
DECLARE @RegularExpression NVARCHAR(MAX) = '^AccountingJournal_((?<year>\d{4})-(?<month>\d{2})-(?<day>\d{2}))_((?<hour>\d{2})-(?<minute>\d{2})-(?<second>\d{2})).xlsx' 
-- Example: '^<<REPLACE_THIS_BY_FILE_NAME>>_((?<year>\d{4})-(?<month>\d{2})-(?<day>\d{2}))_((?<hour>\d{2})-(?<minute>\d{2})-(?<second>\d{2})).<<REPLACE_THIS_BY_FILE_FORMAT>>' 'AccountingJournal_2022-03-01_12-00-00.xlsx'

DECLARE @DataLakeName VARCHAR(MAX) = 'pcsicoredevstadlakene' -- Name for DSU storage account

DECLARE @EntityProperties VARCHAR(MAX) = N'' -- The configuration of this JSON will be updated later

DECLARE @ProviderId INT = (SELECT TOP 1 Id FROM [DataIngestion].[Provider] WHERE ItemId = @ProviderItemId) -- Created before, example @ProviderItemId = 657498AF-4732-4969-9FF4-78F07E1A127A

-------------------------------------------------------------------
-- ENTITY CREATION
-------------------------------------------------------------------

    INSERT INTO [DataIngestion].[Entity]
        ([IdProvider], [Name], [TableName], [RegularExpression],
        [StartValidDate], [EndValidDate], [Serde], [SerdeProperties], [Encoding], [HeaderLines], [FieldDelimiter],
        [LastUpdated], [LastDeployed], [Format], [NullText], [ReCreateTableOnDeployment], [RowDelimiter], [FilesPerDrop],
        [SourcePath], [Description], [AdditionalProperties], [IdTableFormat], [GenerateDeltaTable], [ParentSecurityPath], 
        [CreationDate], [ItemId]) 
    VALUES 
        (@ProviderId, @EntityName, @EntityName, @RegularExpression,
        GETUTCDATE() -1, NULL, NULL, NULL, NULL, 0, NULL, -- Note that the StartValidDate value must be 24h hours prior to the current time to trigger the RegisterAsset pipeline
        GETUTCDATE(), NULL, @FileFormat, NULL, 1, NULL, 1,
        NULL, NULL, @EntityProperties, 3, 0, CONCAT('1/', @ProviderId), 
        GETUTCDATE(), NEWID())
-------------------------------------------------------------------
-- ENTITY INFORMATION CHECKING
-------------------------------------------------------------------

SELECT * FROM DataIngestion.Entity ORDER BY Id desc
Step 2.2. Attributes configuration

Each of the columns of this Excel file will be mapped as an Attribute in Sidra after the number of configurations done below. This is unlike the CSV file ingestion, where the Attributes are directly defined in this step. For Excel file ingestion we indicate each one of the fields (or columns) contained in the Excel file, as well as the type of this field to be used in Databricks and SQL Server. For more information, please check Data intake from CSV files.

The sample SQL script would look similar to this one:

DECLARE @EntityId INT = 2315
--- Created before, example: @EntityItemId = 2315

-------------------------------------------------------------------
-- ATTRIBUTES CREATION
-------------------------------------------------------------------

INSERT INTO [DataIngestion].[Attribute] 
    ([IdEntity],[Name],HiveType,MaxLen,IsNullable,NeedTrim,RemoveQuotes,ReplacedText,ReplacementText,
    SpecialFormat,TreatEmptyAsNull,IsPrimaryKey,[Order],IsCalculated,IsPartitionColumn,IsMetadata,SQLType,
    ValidationText,[Description],SourceName,ParentSecurityPath,IsEncrypted,DataMask,ItemId)
VALUES
    ('2315', 'IdEntity',           'INT',        NULL, 0,     0,        0,     NULL, NULL, NULL,                             0,        1,     1,     0,     0,     0,     'INT',                     NULL, NULL, 'IdEntity',          CONCAT('1/','1087','/','2315'), 0,     NULL,        NEWID())
    ,('2315', 'Location',           'STRING',     NULL, 0,     0,        0,     NULL, NULL, NULL,                             0,        0,     2,     0,     0,     0,     'nvarchar(256)',              NULL, NULL, 'Location',           CONCAT('1/','1087','/','2315'), 0,     NULL,        NEWID())
    ,('2315', 'JsonDataDir',        'STRING',     NULL, 0,     0,        0,     NULL, NULL, NULL,                             0,        0,     3,     0,     0,     0,     'nvarchar(256)',              NULL, NULL, 'JsonDataDir',       CONCAT('1/','1087','/','2315'), 0,     NULL,        NEWID())
    ,('2315', 'LoadDate',           'STRING',     NULL, 0,     0,        0,     NULL, NULL, 'FROM_UNIXTIME(UNIX_TIMESTAMP())',        0,     0,     4,     1,     0,     1,     'datetime2(7)',            NULL, NULL, NULL,                CONCAT('1/','1087','/','2315'), 0,     NULL,        NEWID())
    ,('2315', 'HasErrors',          'BOOLEAN',    NULL, 0,     0,        0,     NULL, NULL, 'FALSE',                                  0,     0,     5,     1,     0,     1,     'bit',                     NULL, NULL, NULL,                CONCAT('1/','1087','/','2315'), 0,     NULL,        NEWID())
    ,('2315', 'SidraIsDeleted',     'BOOLEAN',    NULL, 0,     0,        0,     NULL, NULL, 'SidraIsDeleted',                         0,     0,     6,     1,     0,     1,     'bit',                     NULL, NULL, NULL,                CONCAT('1/','1087','/','2315'), 0,     NULL,        NEWID())
    ,('2315', 'IdSourceItem',       'INT',        NULL, 0,     0,        0,     NULL, NULL, 'IdSourceItem',                   0,        0,     7,     1,     1,     1,     'int',                     NULL, NULL, NULL,                CONCAT('1/','1087','/','2315'), 0,     NULL,        NEWID())

    --- where ('1/','1087','/','2315') = ('DSU/','@ProviderId','/','@EntityId') 
Step 2.3. Pipeline configuration and relation EntityPipeline

We would need to configure the pipeline's template that is going to be used. This template (@PipelineItemId) will always have the same ItemId.

To relate the parent Entity with the data ingestion pipeline responsible for ingesting the data (FileIngestionDatabricks), next script will be necessary:

-------------------------------------------------------------------
-- ENTITYPIPELINE RELATIONSHIP
-------------------------------------------------------------------

DECLARE @EntityId INT = 2315

DECLARE @PipelineItemId UNIQUEIDENTIFIER = 'F8CD4AD9-DCAE-4463-9185-58292D14BE99' --always the same template ItemId

DECLARE @PipelineId INT = (SELECT TOP (1) p.[Id] FROM [DataIngestion].[Pipeline] p WHERE p.[ItemId] = @PipelineItemId)

INSERT INTO [DataIngestion].[EntityPipeline] ([IdEntity], [IdPipeline]) VALUES (@EntityId, @PipelineId)
Step 2.4. File registration in Sidra

Now, we will register the Excel file in the stage storage account for the environment of our example, where the landing zone is located:

pcsicoredevstastagene

Note

Note that the names of the storage accounts used will contain the words stage and dlake to clarify the purpose of each one of them.

Next steps will be required in order to do the ingestion properly:

1. Rename the Excel file

First, we will check the RegularExpression created:

SELECT * FROM DataIngestion.Entity ORDER BY Id desc
And we will apply the naming convention described before in the SQL script, changing manually the name of the Excel file as it follows before uploading it:

AccountingJournal_2022-03-01_12-00-00.xlsx

After this, the file will be ready to go on with the process of ingestion.

2. Deposit the file in the landing zone

We will go to the landing blob container in the stage storage account commented before (pcsicoredevstastagene). There, we will follow the next convention:

/<name_provider>/<name_entity>/<year>/<month>/<day>/<file>.xlsx

In our example, we will create a folder with the name of the Provider (MyFirstExcelIngestion). Inside the folder, we will create a folder with the name of the Entity (AccountingJournalEntity). There, we will create three consecutive folders, each one inside the last one created, with the actual date, starting per year (2022) and following per month (02) and day (21). In the 'day' folder, we will store the renamed Excel file AccountingJournal_2022-03-01_12-00-00.xlsx.

Uploading the file in the stage storage account as done above, we should be able to see in Data Factory a new pipeline called RegisterAsset, triggered by the RegularExpression, executed and responsible for creating the Asset into the storage account specific of the DSU. This means that there will be an automatic movement of the folders created (including the Excel file inside of them) from the landing zone to a new Azure Storage container called with the same name of the Provider (in our example, myfirstexcelingestion) in the same stage storage account (pcsicoredevstastagene), launched by the RegisterAsset pipeline. Note that the name of the Excel file will be changed, adding an Id number at the end of the name.

Folder movement

Once the files are in the landing zone, a file ingestion process is triggered every time there is a new drop of data in the landing zone.

Step 2.5. Infer the schema

Sidra supports ingestion of complex data in Excel files, and for this, it provides several helper tools to ease the process of configuring the metadata, like what tables are present and where are located in the Excel file. One of them is the SchemaInference script.

Then, next step will be to use this script in Databricks workspace to extract the schema of the Excel data file as a guide and later check it manually or with the Validate schema method inside the same script.

dataFileDir = "https://pcsicoredevstastagene.blob.core.windows.net/myfirstexcelingestion/AccountingJournalEntity/2022/03/01/accountingjournal_2022-03-01_12-00-00_id3951.xlsx"
dataSelection = "'Accounting Journal'!B4:N134" 
df = infer_schema(dataFileDir, dataSelection)
display(df)
The steps would be:

  1. Copy the URL of the Excel file accountingjournal_2022-03-01_12-00-00_id3951.xlsx inside the new container myfirstexcelingestion from the stage storage account and paste it as a path in the Databricks Notebook method (dataFileDir).

  2. Select the cells of the Excel with the wanted information to ingest in the field dataSelection.

  3. Execute the first cell of the Python notebook in Databricks in order to initialize the methods definitions and run the SchemaInference method with tha values dataFileDirand dataSelection changed as in the example above.

  4. Then, we should obtain the display of a dataframe generated from the Excel file data and the dataframe schema itself infered by the method applied.

SchemaInference

Note

This infered schema must always be checked manually by the user, given that the library could infer types that are not desired, for example inferring numeric values as string.

Step 2.6. JSON configuration

The provided schema must be formatted to JSON with the following syntax and must populate the AdditionalProperties field of the created Entity.

-------------------------------------------------------------------
-- JSON CONFIGURATION
-------------------------------------------------------------------

UPDATE DataIngestion.Entity set AdditionalProperties = N'{
"TransferQuery": "Excel/ExcelDataIntake",
"entitySchema": 
    [
        {
            "identifier":"Accounting Journal.AccountingJournal_table",
            "type": "table",
            "range": "B4:N4",
            "schema": {     
                "Code":"double",
                "_c1":"string",
                "Balance Sheet":"string",
                "Debit3":"double",
                "Credit4":"double",
                "Debit5":"double",
                "Credit6":"double",
                "Debit7":"double",
                "Credit8":"double",
                "Debit9":"double",
                "Credit10":"double",
                "Debit11":"double",
                "Credit12":"double"
            }
        }
    ]
}' where Id = 2315 --example

-------------------------------------------------------------------
-- ENTITY INFORMATION CHECKING
-------------------------------------------------------------------

SELECT * FROM DataIngestion.Entity ORDER BY Id DESC

Where:

  • Each key-value defining the schema will represent a section of the data within the sheet of the Excel file.
  • The identifier will be the name Accounting Journal.AccountingJournal_table, with AccountingJournal_table specified later in the JSON with the data from the Excel file.

Step 3. Child Entities configuration

The ViewsDefinition script, once executed manually by the user, is responsible for inserting all the views that result from the configuration of a parent Entity. This means, adding entries for the new Entities in the Sidra Core metadata database, as well as adding the Attributes for such Entities (also in the Sidra Core metadata database).

We will use the ViewsDefinition script configuring it with the additionalProperties values and the following:

parentEntityName = 'AccountingJournalEntity'
parentEntityTable = 'AccountingJournalEntity'
parentEntityItemId = '34AE0551-A452-43C9-B040-E7356F57F203' # ItemId on Entity table
providerItemId = '657498AF-4732-4969-9FF4-78F07E1A127A'
providerDatabaseName = 'myfirstexcelingestion'
storageName = 'pcsicoredevstadlakene'  #DSU

This script will give as a result an SQL script representing the ViewsDefinition of the data and the child Entity related to it. After executing the SQL script, the view structure of the Excel file will be defined, which will be passed as metadata Attributes in order to visualize the data as table in Databricks.

Step 4. Ingestion of data in Databricks

  1. Execute the CreateTable pipeline in Data Factory.
  2. In Databricks, the table with the data from the Excel file will have been created. To make sure, check the data section of Databricks in the database defined before (in our case myfirstexcelingestion).
  3. Now, execute the FileIngestionDatabricks pipeline in Data Factory. This pipeline has the following parameters:

    • assetDate: is the execution's date. Example: 3/1/2022 12:00:00 AM
    • assetReferences: is the path with the assetId that we ingested previously in the container of the storage account. The information in this path will be splitted by the ExcelDataIntake.py transfer query to generate all the information. Example: 3951,wasbs://myfirstexcelingestion@pcsicoredevstastagene.blob.core.windows.net/AccountingJournalEntity/2022/03/ 01/accountingjournal_2022-03-01_12-00-00_id3951.xlsx.
    • input: the same path without the Id and the final name. Example: wasbs://myfirstexcelingestion@pcsicoredevstastagene.blob.core.windows.net/AccountingJournalEntity/2022/03/01 .
    • assetId. Example: 3951.
    • tableName: the name of the Entity. Example: AccountingJournalEntity.
    • providerName. Example: MyFirstExcelIngestion.
    • providerId. Example: 1087.

With the execution of this pipeline:

  1. The ExcelDataIntake.py transfer query will read, thanks to Spark libraries, all the content of the table and will create, from this metadata, the JSON file with all the information from the file.
  2. The JSON file will be uploaded into the DSU storage account, in the container with the name of the Provider.

    Data lake uploading

  3. Afterwards, the ExcelDataIntake.py transfer query will create a view inside the database which will have the view definition configured for the child Entities.

  4. There will be now a delta table inside Databricks with the name MyFirstExcelIngestion.AccountingJournalEntity_accountingjournal_table with all the data coming from the JSON file.

Last update: 2022-03-18
Back to top