How Excel data intake works¶
Information
Take into account that this tutorial can be used as an illustrative guide for Sidra's version 1.11 (2022.R1). Due to its nature subject of continuous development, there may be minor changes to the displayed code. For more information, you can contact our Sidra's Support Team.
In this tutorial we detail how to configure a Sidra Provider, Entity and the Attributes to load Excel files in the platform through an example.
Introduction¶
The process of Excel data intake can be carried out by using Sidra API or an SQL script directly on the Sidra Core metadata database. A sample SQL script is provided to apply such configuration.
As opposed to database Providers, as there is not an automatic metadata inference pipeline yet available, we need to manually specify each of the Entities that are needed to be loaded, as well as their Attributes.
The Data Intake Process for file ingestion from the landing zone is described conceptually in the section Data intake via landing zone.
Configuration for data intake from Excel¶
There are several ways to configure this kind of intake of data:
- Configuration through the API.
- Configuration through SQL script (recommendable).
1. Configuration through the API¶
The following calls to the API can be performed through Swagger or PowerShell:
- Endpoint
/api/metadata/providers
: used to create the Provider. - Endpoint
/api/metadata/entities
: used to create the Entities. - Endpoint
/api/metadata/attributes
: used to create the Attributes associated to an Entity. - Endpoint
/api/metadata/entities/{idEntity}/pipelines/{idPipeline}
: used to associate an Entity with a pipeline.
Step-by-step
Add new Provider
This section explains how to add a new Provider .
Add new Entity
This section explains how to add a new Entity .
Add new Attribute
This section explains how to add a new Attribute .
Associate Entity to pipeline
This section explains how to associate Entities to pipelines .
2. Configuration through SQL script (example)¶
The SQL script detailed below presents a set of variables at the start that are needed to create the Sidra metadata objects. By default, the Entities will be associated with the pipeline FileIngestionDatabricks
.
Next, we will detail the variables that are needed to be specified following a given example. It would correspond to an Excel file with the below structure:
Step 1. Provider configuration and creation¶
We will need to configure and create a Provider in the SQL server where we have the Core database:
-------------------------------------------------------------------
-- PROVIDER CONFIGURATION
-------------------------------------------------------------------
DECLARE @ProviderItemId UNIQUEIDENTIFIER = NEWID()
DECLARE @ProviderName VARCHAR(MAX)= 'MyFirstExcelIngestion'
DECLARE @ProviderDatabaseName VARCHAR(MAX)= 'myfirstexcelingestion' -- lower case
DECLARE @ProviderDescription VARCHAR(MAX) = 'Attempt to ingest specific data from the AccountingJournal Excel file sample'
DECLARE @ProviderOwner NVARCHAR(MAX) = N'[email protected]'
DECLARE @IdDSU INT = 1
-------------------------------------------------------------------
-- PROVIDER CREATION
-------------------------------------------------------------------
INSERT INTO [DataIngestion].[Provider]
([ItemId], [ProviderName], [DatabaseName], [Description], [Owner], [IdDataStorageUnit], [ParentSecurityPath],
[CreationDate])
VALUES
(@ProviderItemId, @ProviderName, @ProviderDatabaseName, @ProviderDescription, @ProviderOwner, @IdDSU, '1',
GETUTCDATE())
-------------------------------------------------------------------
-- PROVIDER INFORMATION CHECKING
-------------------------------------------------------------------
SELECT * FROM DataIngestion.Provider ORDER BY Id desc
Step 2. Parent Entity metadata configuration and creation¶
Step 2.1. Entity configuration and creation¶
In the case of the Entities associated to this Provider, we will need to specify some additional data.
For example, we need to specify the file format (in this case it will be .xlsx), the target DSU (Data Storage Unit) and the regular expression, used to link the files deposited in the landing zone with the corresponding Entity.
The regular expression described in the example below identifies a file with the following name. You may download the file to view the data:
This expression needs always to match the name of the document, so that Sidra can identify the latest Assets for that Entity.
The sample SQL script would look similar to this one:
-------------------------------------------------------------------
-- ENTITY CONFIGURATION
-------------------------------------------------------------------
DECLARE @EntityName VARCHAR(MAX) = 'AccountingJournalEntity'
DECLARE @FileFormat VARCHAR(MAX) = 'xlsx' -- Options: file formats of Excel
DECLARE @RegularExpression NVARCHAR(MAX) = '^AccountingJournal_((?<year>\d{4})-(?<month>\d{2})-(?<day>\d{2}))_((?<hour>\d{2})-(?<minute>\d{2})-(?<second>\d{2})).xlsx'
-- Example: '^<<REPLACE_THIS_BY_FILE_NAME>>_((?<year>\d{4})-(?<month>\d{2})-(?<day>\d{2}))_((?<hour>\d{2})-(?<minute>\d{2})-(?<second>\d{2})).<<REPLACE_THIS_BY_FILE_FORMAT>>' 'AccountingJournal_2022-03-01_12-00-00.xlsx'
DECLARE @DataLakeName VARCHAR(MAX) = 'sdscoredevstadlakene' -- Name for DSU storage account
DECLARE @EntityProperties VARCHAR(MAX) = N'' -- The configuration of this JSON will be updated later
DECLARE @ProviderId INT = (SELECT TOP 1 Id FROM [DataIngestion].[Provider] WHERE ItemId = @ProviderItemId) -- Created before, example @ProviderItemId = 657498AF-4732-4969-9FF4-78F07E1A127A
-------------------------------------------------------------------
-- ENTITY CREATION
-------------------------------------------------------------------
INSERT INTO [DataIngestion].[Entity]
([IdProvider], [Name], [TableName], [RegularExpression],
[StartValidDate], [EndValidDate], [Serde], [SerdeProperties], [Encoding], [HeaderLines], [FieldDelimiter],
[LastUpdated], [LastDeployed], [Format], [NullText], [ReCreateTableOnDeployment], [RowDelimiter], [FilesPerDrop],
[SourcePath], [Description], [AdditionalProperties], [IdTableFormat], [GenerateDeltaTable], [ParentSecurityPath],
[CreationDate], [ItemId])
VALUES
(@ProviderId, @EntityName, @EntityName, @RegularExpression,
GETUTCDATE() -1, NULL, NULL, NULL, NULL, 0, NULL, -- Note that the StartValidDate value must be 24h hours prior to the current time to trigger the RegisterAsset pipeline
GETUTCDATE(), NULL, @FileFormat, NULL, 1, NULL, 1,
NULL, NULL, @EntityProperties, 3, 0, CONCAT('1/', @ProviderId),
GETUTCDATE(), NEWID())
-------------------------------------------------------------------
-- ENTITY INFORMATION CHECKING
-------------------------------------------------------------------
SELECT * FROM DataIngestion.Entity ORDER BY Id desc
Step 2.2. Attributes configuration¶
Each of the columns of this Excel file will be mapped as an Attribute in Sidra after the number of configurations done below. This is unlike the CSV file ingestion, where the Attributes are directly defined in this step. For Excel file ingestion we indicate each one of the fields (or columns) contained in the Excel file, as well as the type of this field to be used in Databricks and SQL Server. For more information, please check Data intake from CSV files.
The sample SQL script would look similar to this one:
DECLARE @EntityId INT = 2315
--- Created before, example: @EntityItemId = 2315
-------------------------------------------------------------------
-- ATTRIBUTES CREATION
-------------------------------------------------------------------
INSERT INTO [DataIngestion].[Attribute]
([IdEntity],[Name],HiveType,MaxLen,IsNullable,NeedTrim,RemoveQuotes,ReplacedText,ReplacementText,
SpecialFormat,TreatEmptyAsNull,IsPrimaryKey,[Order],IsCalculated,IsPartitionColumn,IsMetadata,SQLType,
ValidationText,[Description],SourceName,ParentSecurityPath,IsEncrypted,DataMask,ItemId)
VALUES
('2315', 'IdEntity', 'INT', NULL, 0, 0, 0, NULL, NULL, NULL, 0, 1, 1, 0, 0, 0, 'INT', NULL, NULL, 'IdEntity', CONCAT('1/','1087','/','2315'), 0, NULL, NEWID())
,('2315', 'Location', 'STRING', NULL, 0, 0, 0, NULL, NULL, NULL, 0, 0, 2, 0, 0, 0, 'nvarchar(256)', NULL, NULL, 'Location', CONCAT('1/','1087','/','2315'), 0, NULL, NEWID())
,('2315', 'JsonDataDir', 'STRING', NULL, 0, 0, 0, NULL, NULL, NULL, 0, 0, 3, 0, 0, 0, 'nvarchar(256)', NULL, NULL, 'JsonDataDir', CONCAT('1/','1087','/','2315'), 0, NULL, NEWID())
,('2315', 'LoadDate', 'STRING', NULL, 0, 0, 0, NULL, NULL, 'FROM_UNIXTIME(UNIX_TIMESTAMP())', 0, 0, 4, 1, 0, 1, 'datetime2(7)', NULL, NULL, NULL, CONCAT('1/','1087','/','2315'), 0, NULL, NEWID())
,('2315', 'HasErrors', 'BOOLEAN', NULL, 0, 0, 0, NULL, NULL, 'FALSE', 0, 0, 5, 1, 0, 1, 'bit', NULL, NULL, NULL, CONCAT('1/','1087','/','2315'), 0, NULL, NEWID())
,('2315', 'SidraIsDeleted', 'BOOLEAN', NULL, 0, 0, 0, NULL, NULL, 'SidraIsDeleted', 0, 0, 6, 1, 0, 1, 'bit', NULL, NULL, NULL, CONCAT('1/','1087','/','2315'), 0, NULL, NEWID())
,('2315', 'IdSourceItem', 'INT', NULL, 0, 0, 0, NULL, NULL, 'IdSourceItem', 0, 0, 7, 1, 1, 1, 'int', NULL, NULL, NULL, CONCAT('1/','1087','/','2315'), 0, NULL, NEWID())
--- where ('1/','1087','/','2315') = ('DSU/','@ProviderId','/','@EntityId')
Step 2.3. Pipeline configuration and relation EntityPipeline¶
We would need to configure the pipeline's template that is going to be used. This template (@PipelineItemId
) will always have the same ItemId.
To relate the parent Entity with the data ingestion pipeline responsible for ingesting the data (FileIngestionDatabricks
), next script will be necessary:
-------------------------------------------------------------------
-- ENTITYPIPELINE RELATIONSHIP
-------------------------------------------------------------------
DECLARE @EntityId INT = 2315
DECLARE @PipelineItemId UNIQUEIDENTIFIER = 'F8CD4AD9-DCAE-4463-9185-58292D14BE99' --always the same template ItemId
DECLARE @PipelineId INT = (SELECT TOP (1) p.[Id] FROM [DataIngestion].[Pipeline] p WHERE p.[ItemId] = @PipelineItemId)
INSERT INTO [DataIngestion].[EntityPipeline] ([IdEntity], [IdPipeline]) VALUES (@EntityId, @PipelineId)
Step 2.4. File registration in Sidra¶
Now, we will register the Excel file in the stage storage account for the environment of our example, where the landing zone is located:
sdscoredevstastagene
Note that the names of the storage accounts used will contain the words stage and dlake to clarify the purpose of each one of them.
Next steps will be required in order to do the ingestion properly:
1. Rename the Excel file
First, we will check the RegularExpression
created:
AccountingJournal_2022-03-01_12-00-00.xlsx
After this, the file will be ready to go on with the process of ingestion.
2. Deposit the file in the landing zone
We will go to the landing blob container in the stage storage account commented before (sdscoredevstastagene
). There, we will follow the next convention:
/<name_provider>/<name_entity>/<year>/<month>/<day>/<file>.xlsx
In our example, we will create a folder with the name of the Provider (MyFirstExcelIngestion
). Inside the folder, we will create a folder with the name of the Entity (AccountingJournalEntity
). There, we will create three consecutive folders, each one inside the last one created, with the actual date, starting per year (2022) and following per month (02) and day (21). In the 'day' folder, we will store the renamed Excel file AccountingJournal_2022-03-01_12-00-00.xlsx
.
Uploading the file in the stage storage account as done above, we should be able to see in Data Factory a new pipeline called RegisterAsset
, triggered by the RegularExpression
, executed and responsible for creating the Asset into the storage account specific of the DSU. This means that there will be an automatic movement of the folders created (including the Excel file inside of them) from the landing zone to a new Azure Storage container called with the same name of the Provider (in our example, myfirstexcelingestion
) in the same stage storage account (sdscoredevstastagene
), launched by the RegisterAsset
pipeline. Note that the name of the Excel file will be changed, adding an Id number at the end of the name.
Once the files are in the landing zone, a file ingestion process is triggered every time there is a new drop of data in the landing zone.
Step 2.5. Infer the schema¶
Sidra supports ingestion of complex data in Excel files, and for this, it provides several helper tools to ease the process of configuring the metadata, like what tables are present and where are located in the Excel file. One of them is the SchemaInference
script.
Then, next step will be to use this script in Databricks workspace to extract the schema of the Excel data file as a guide and later check it manually or with the Validate schema
method inside the same script.
dataFileDir = "https://sdscoredevstastagene.blob.core.windows.net/myfirstexcelingestion/AccountingJournalEntity/2022/03/01/accountingjournal_2022-03-01_12-00-00_id3951.xlsx"
dataSelection = "'Accounting Journal'!B4:N134"
df = infer_schema(dataFileDir, dataSelection)
display(df)
-
Copy the URL of the Excel file
accountingjournal_2022-03-01_12-00-00_id3951.xlsx
inside the new containermyfirstexcelingestion
from the stage storage account and paste it as a path in the Databricks Notebook method (dataFileDir
). -
Select the cells of the Excel with the wanted information to ingest in the field
dataSelection
. -
Execute the first cell of the Python notebook in Databricks in order to initialize the methods definitions and run the
SchemaInference
method with tha valuesdataFileDir
anddataSelection
changed as in the example above. -
Then, we should obtain the display of a dataframe generated from the Excel file data and the dataframe schema itself infered by the method applied.
This infered schema must always be checked manually by the user, given that the library could infer types that are not desired, for example inferring numeric values as string.
Step 2.6. JSON configuration¶
The provided schema must be formatted to JSON with the following syntax and must populate the AdditionalProperties
field of the created Entity.
-------------------------------------------------------------------
-- JSON CONFIGURATION
-------------------------------------------------------------------
UPDATE DataIngestion.Entity set AdditionalProperties = N'{
"TransferQuery": "Excel/ExcelDataIntake",
"entitySchema":
[
{
"identifier":"Accounting Journal.AccountingJournal_table",
"type": "table",
"range": "B4:N4",
"schema": {
"Code":"double",
"_c1":"string",
"Balance Sheet":"string",
"Debit3":"double",
"Credit4":"double",
"Debit5":"double",
"Credit6":"double",
"Debit7":"double",
"Credit8":"double",
"Debit9":"double",
"Credit10":"double",
"Debit11":"double",
"Credit12":"double"
}
}
]
}' where Id = 2315 --example
-------------------------------------------------------------------
-- ENTITY INFORMATION CHECKING
-------------------------------------------------------------------
SELECT * FROM DataIngestion.Entity ORDER BY Id DESC
Where:
- Each key-value defining the schema will represent a section of the data within the sheet of the Excel file.
- The
identifier
will be the nameAccounting Journal.AccountingJournal_table
, withAccountingJournal_table
specified later in the JSON with the data from the Excel file.
Step 3. Child Entities configuration¶
The ViewsDefinition
script, once executed manually by the user, is responsible for inserting all the views that result from the configuration of a parent Entity. This means, adding entries for the new Entities in the Sidra Core metadata database, as well as adding the Attributes for such Entities (also in the Sidra Core metadata database).
We will use the ViewsDefinition
script configuring it with the additionalProperties
values and the following:
parentEntityName = 'AccountingJournalEntity'
parentEntityTable = 'AccountingJournalEntity'
parentEntityItemId = '34AE0551-A452-43C9-B040-E7356F57F203' # ItemId on Entity table
providerItemId = '657498AF-4732-4969-9FF4-78F07E1A127A'
providerDatabaseName = 'myfirstexcelingestion'
storageName = 'sdscoredevstadlakene' #DSU
This script will give as a result an SQL script representing the ViewsDefinition
of the data and the child Entity related to it. After executing the SQL script, the view structure of the Excel file will be defined, which will be passed as metadata Attributes in order to visualize the data as table in Databricks.
Step 4. Ingestion of data in Databricks¶
- Execute the
CreateTable
pipeline in Data Factory. - In Databricks, the table with the data from the Excel file will have been created. To make sure, check the data section of Databricks in the database defined before (in our case
myfirstexcelingestion
). -
Now, execute the
FileIngestionDatabricks
pipeline in Data Factory. This pipeline has the following parameters:assetDate
: is the execution's date. Example: 3/1/2022 12:00:00 AMassetReferences
: is the path with theassetId
that we ingested previously in the container of the storage account. The information in this path will be splitted by theExcelDataIntake.py
DSU ingestion script to generate all the information. Example: 3951,wasbs://[email protected]/AccountingJournalEntity/2022/03/ 01/accountingjournal_2022-03-01_12-00-00_id3951.xlsx.input
: the same path without the Id and the final name. Example: wasbs://[email protected]/AccountingJournalEntity/2022/03/01 .assetId
. Example: 3951.tableName
: the name of the Entity. Example: AccountingJournalEntity.providerName
. Example: MyFirstExcelIngestion.providerId
. Example: 1087.
With the execution of this pipeline:
- The
ExcelDataIntake.py
DSU ingestion script will read, thanks to Spark libraries, all the content of the table and will create, from this metadata, the JSON file with all the information from the file. -
The JSON file will be uploaded into the DSU storage account, in the container with the name of the Provider.
-
Afterwards, the
ExcelDataIntake.py
DSU ingestion script will create a view inside the database which will have the view definition configured for the child Entities. - There will be now a delta table inside Databricks with the name
MyFirstExcelIngestion.AccountingJournalEntity_accountingjournal_table
with all the data coming from the JSON file.