How CSV data intake works¶

This tutorial applies to intakes of Landing Zone type and is an illustrative guide. There may be minor frequent changes to the code in the scripts. Use these scripts as starters to build your own. The scripts were tested on Sidra 2025.5. For more information, you may contact our Sidra's Support Team.

In this tutorial we illustrate how to configure a Sidra Entity to ingest data from CSV files into the platform.

Introduction¶

For a data intake process to take place, Sidra needs the metadata - definitions for each Entity and the Attributes - corresponding to the CSV table with its columns.

As opposed to database Providers, there is no automatic metadata inference pipeline available. This means that one needs to manually specify this metadata, for each of the Entities (tables) that should be ingested, as well as their Attributes (columns).

Defining the metadata, as an overview, refers to:

Ensure we have a Provider for the various CSV files to be ingested.
Define an Entity for each type of CSV to be ingested.
For each Entity (type of CSV), define the Attributes corresponding to the columns/fields present in that CSV.
- Note that there are also some system-required Attributes, needed for specifying how to read and process the columns/fields in Databricks tables.
Associate the Entity with the ingestion Pipeline for data files.

Declaring the metadata can be carried out by either calling the Sidra API, or executing a SQL script directly on the Sidra Service metadata database. The PowerShell script sample below is calling the Sidra API and can be used to start defining CSV-ingesting Entities.

It is recommended to declare the metadata for the Entities (CSVs) to be ingested using the Sidra Service API calls. The API is performing additional validations. By contrast, adding the metadata with a SQL script does not ensure input validation. The Transact-SQL snippets in this article are only shown as illustration.

Defining the metadata¶

Sidra metadata tables do not allow for space characters in the names of Attributes or Entities. The metadata API will sanitize spaces in characters if they are provided in Entity or Attribute names, by replacing with _ character. Such sanitation is only done by the API; the SQL scripts will not do it.

There are rules on configuring the metadata. The sample scripts in this article - for PowerShell-based API calls - may be used, but you should be aware of the restrictions, limitations, and settings affecting the ingestion.

Defining the Provider¶

The Provider should be regarded as the data source, like the database or the system generating the CSV files. The main fields for a Provider definition:

The [ProviderName] is the human-friendly identifier, no longer than 30 chars. Keep in mind that this field only takes unique values; can't have 2 Providers having the same name.
The [DatabaseName] will identify the target Databricks destination DB and the Azure Storage container for Assets. Hence, observe the naming limitations as per Azure's documentation.
The DSU identifier or the Azure DSU Resource Group where data is ingested to Databricks. Most Sidra deployments would use only 1 DSU. The DSU records are stored in the Core DB in [DataIngestion].[DataStorageUnit] table.

The Provider table definition may help, for details.

Defining the Entity¶

It helps thinking of the Entity as being a database table; each Entity will have a Databricks table associated. Metadata about the Entity tells how the data is to be ingested in the table. Things to observe:

The [TableName] field of the Entity will be used to identify the Databricks table; hence, restrictions apply.
Most notably, the space ' ' character must not be used. In fact, the Java libraries in Databricks will reject table names containing " ,;{}()\n\t".
The [StartValidDate] and [EndValidDate] are also filters. CSV blobs for Assets with AssetDate that don't fall within the interval will not be ingested.
The AssetDate, which is considered for the interval, will be deduced from the CSV blob name or path. The options are:
- Blob path contains /<year>/<month>/<date>/ segments; example blob name under the landing container:
  landing/<providerName>/<entityName>/YYYY/MM/DD/Bike-Store.Bikes.csv
- The file name should indicate the date, and a matching RegEx should extract it; see below.
The [RegularExpression] acts like a filter and mapper, and, optionally, AssetDate extractor.
- Only CSV blobs with names matching the RegEx will be associated to this Entity and its Pipeline.
- Here is an example of a RegEx able to extract AssetDate from the blob's file name:
  ^Bike-Store.Bikes_((?<year>\d{4})-(?<month>\d{2})-(?<day>\d{2}))_((?<hour>\d{2})-(?<minute>\d{2})-(?<second>\d{2}))\.csv$
  matching blobs named like:
  landing/<providerName>/<entityName>/Bike-Store.Bikes_YYYY-MM-DD_hh-mm-ss.csv
- CSV blobs not matching any Entity RegEx will be rejected as noise artifacts.
  Rejections will be logged as errors in the [Log].[Util].[Log] database table.
A [Field Delimiter] is relevant for how the CSVs will be parsed:
- The default is comma ,. But although CSV denotes Comma-Separated Values, the delimiter may also be semicolon ;, TAB \t, pipe |, ampersand &, circumflex ^, or others.
- Columns [FieldDelimiter], [HeaderLines], [Encoding], and [RowDelimiter] of [DataIngestion].[Entity] are deprecated and even disregarded by the new ingestion module.
- A field delimiter can be specified in a readerOptions field, within the [AdditionalProperties] column that expects a JSON. If the JSON from the [AdditionalProperties] includes a readerOptions object, that object is being passed as Data Source Option to the Databricks Spark parser used for ingesting CSV files.
- A UTF-8 representation of the delimiter may be used, like \u0009 for TAB, \u002C for comma, or others.

Since release 2022.R3 (1.13.0), the [FieldDelimiter], [HeaderLines], [Encoding], and [RowDelimiter] are disregarded by the new ingestion module. A readerOptions object can be passed in the [AdditionalProperties] field of the Entity; it will configure how the CSV file is being parsed by the Spark library.

The Entity table definition may help, for details.

Defining the Attributes¶

Just like an Entity describes a table, the Attributes are describing the table columns, or the record fields. A few things to observe:

Don't use the space ' ' character in the [Name]. The Java libraries in Databricks will reject table names containing " ,;{}()\n\t".
The [HiveType], [SqlType], and [SourceType] describe what kind of data the column has, and how it should be treated by Databricks or any SQL DB used in Data Products.
All entities must define some additional 5 system-required metadata Attributes, used in Data Lake. These are created by the sample script below. They should be the last in relation to the [Order] field of the Attributes.
- The SidraIsDeleted Attribute;
- The SidraPassedValidation Attribute;
- The SidraLoadDate Attribute;
- The SidraIdAsset Attribute;
- The SidraFileDate Attribute.
One of the columns in the CSV should be a Primary Key, if records are planned to be updated with CSV ingestions for the Entity. Its corresponding Attribute should have [IsPrimaryKey] set to 1.

The Attribute table definition may help, for details.

Pipeline and Stage landing¶

The defined Entity should be processed by the FileIngestionDatabricks Pipeline; so, a mapping record should exist in the [DataIngestion].[EntityPipeline] table. This pipeline will be responsible for:

Registering a new Asset for each new CSV file to be ingested for the Entity. Once registered, the Asset's CSV blob is moved under a Stage container with the name of the Provider.
Creating (or re-creating) the target Databricks table where records would be ingested, if that table does not exist.
Running the DSU ingestion script to bring data from the registered Asset (CSV blob) into the Databricks Entity's table.

The CSV files should be dropped under the landing container of the DSU Stage Azure Storage Account - the account containing stage in its name. Additionally:

These file blobs should be placed under a folder named like /<provider-name>/<entity-name>/;
The file names should match the [RegularExpression] set for the Entity;
The full name of the resulting blob should indicate the Asset date:
- Either its path should contain /<year>/<month>/<date>/ segments after /<provider-name>/<entity-name>/;
- Or its file name should indicate the date, and the RegEx matching should be built to extract these.
  Here is an example of a RegEx able to extract AssetDate from the blob's file name:
  ^Bike-Store.Bikes_((?<year>\d{4})-(?<month>\d{2})-(?<day>\d{2}))_((?<hour>\d{2})-(?<minute>\d{2})-(?<second>\d{2}))\.csv$
  matching blobs named like:
  landing/<providerName>/<entityName>/Bike-Store.Bikes_YYYY-MM-DD_hh-mm-ss.csv

If everything is correct, the following sequence should be observed: 1. The CSV file is dropped as a blob, in the Staging Azure Storage account, under the landing container, with the indicated path and naming. 2. This triggers the RegisterAsset Pipeline in the Data Factory. + With this pipeline run, the blob is registered as an Asset:
The blob is associated to the Entity using the blob path /<provider-name>/<entity-name>/ and the matching RegEx of the Entity. 3. Upon completion, the RegisterAsset Pipeline starts a FileIngestionDatabricks Pipeline run.
This Pipeline invokes the Databricks notebook ingesting the data into the DSU Data Lake. 4. At the end of the FileIngestionDatabricks Pipeline run, the records from the CSV file will be visible in the Entity's Databricks table.

The sample script, referenced below in this guide, illustrates the overall process needed: adding a new Provider, adding a new Entity, adding new Attributes, associating the Entity with a pipeline, and checking Assets metadata.

Configuration through API calls¶

The following calls to the Sidra Service API can be performed through Swagger or PowerShell, to declare the metadata as per above instructions:

Endpoint /api/metadata/providers: used to create the Provider.
Endpoint /api/metadata/entities: used to create the Entities.
Endpoint /api/metadata/attributes: used to create the Attributes associated to an Entity.
Endpoint /api/metadata/entities/{idEntity}/pipelines/{idPipeline}: used to associate an Entity with a pipeline.

Helper PowerShell script¶

To help automate defining the metadata, you can use the following PowerShell script and its associated template input file:

DOWNLOAD define-csv-entity.zip

Find inside: + The Define-CSV-Entity.ps1 PowerShell helper script. + Sample Define-CSV-Entity-*.psd1 showing how to fill in the required parameters and metadata. + Sample CSV files with data from the Bike-Stores SQL Server sample DB. + Instructional README about the PowerShell script and its usage. + An alternative version of the script, for engineers having restricted access to Sidra's Azure resources.

Usage example:

PS D:\Sidra.SupportTools> .\Define-CSV-Entity.ps1 -ParamsFile "D:\Temp\Define-CSV-Entity-Bikes.psd1";

Script behavior¶

The DSU with its implicit Data Factory entry will be selectable at runtime, if multiple of these are present within the Sidra installation.

The Provider and Entity records are upserted: + If they don't exist in the [Core] DB, they are created; + If they're already present in the [Core] DB, they're updated.

If an Attribute collection already exists for an (already existing) Entity in the [Core] DB, the collection may be: + Re-created from scratch - all Attribute records are deleted and then re-created; or + Updated - each Attribute in the input is upserted in the existing collection. The corresponding option will be selectable at runtime, if applicable.

Walkthrough: Sample load from a CSV file¶

Follow the below instructions for a first test-drive CSV ingestion with Sidra:

A. Prepare setup

On a machine with PowerShell 7 and the Azure PowerShell module installed, download the define-csv-entity.zip. Let's say we're extracting the content into D:\Sidra.SupportTools.

With this walkthrough, we intend to ingest into Sidra the Bike-Store.Bikes_2022-01-28_16-30-21.csv file:
D:\Sidra.SupportTools\Sample-CSV-files\Bike-Store.Bikes_2022-01-28_16-30-21.csv

B. Define ingestion metadata

Locate the file Define-CSV-Entity-more-metadata.psd1, copy it as D:\Temp\Define-CSV-Entity-Bikes.psd1. Note that a PSD1 file is like a "JSON for PowerShell". Edit the target file:

Populate the SidraCore with the TenantID, SubscriptionID, and ResourceGroupName.
In the Metadata, focus on the Provider.Name and Entity.Name and edit these fields.
These values will determine the path for dropping the CSV file as an Azure Storage blob.
In the Metadata, review and edit the Entity.FieldDelimiter:
The sample CSV file we're loading is using the TAB character as field delimiter.
In the Metadata, review and edit the Entity.RegularExpression:
- Notice that the existing sample matches the CSV file name; and
- The RegEx is built to recognize the time-stamp in the file name: it will extract an AssetDate from the file name.
Review and edit the other Metadata fields containing POPULATE-THIS.
Notice that the Attributes section describes the columns present in the CSV file being loaded.
No changes should be needed here.

With the PSD file saved, we can now execute the script:

PS D:\Sidra.SupportTools> .\Define-CSV-Entity.ps1 -ParamsFile "D:\Temp\Define-CSV-Entity-Bikes.psd1";

Once the script has executed successfully, we may proceed to dropping the CSV file in the landing location.

C. Deposit the file in the landing zone

Locate the landing zone: a. Find the Sidra's target DSU Azure resource group; then, inside... b. Find the Storage Account with the name that contains stage: that's the staging storage account. c. You may need to whitelist, within its firewall, the IP address of the client machine dropping the CSV file there. d. Within the stage (staging) storage account, locate the container named landing.
Create a virtual folder /<providerName>/<entityName>/ under the landing container.
Use the values Provider.Name and Entity.Name from the PSD1 file, from under the Metadata.
Drop the CSV file under virtual folder /<providerName>/<entityName>/. We should end up with a blob having its full path:
/landing/<providerName>/<entityName>/Bike-Store.Bikes_YYYY-MM-DD_hh-mm-ss.csv

Note:
We're not using the format /<providerName>/<entityName>/<year>/<month>/<day>/<file>.csv, because the Entity.RegularExpression is already able to extract an AssetDate from the blob's file name. This format can be used, as an alternative, when the file name does not contain a time-stamp and/or the Entity's regular expression is not built to extract a time-stamp from the file name.

Note:
The dropped CSV file blob will shortly disappear. This is "by design". The blob is being moved by the ingestion pipeline, if everything is correctly configured.

D. Follow the data flow

If everything was properly configured, the following steps will ingest the CSV data into the Data Lake. These can be observed by opening the Data Factory studio and the Databricks workspace, for the respective resources pertaining to the target DSU were the file is being ingested.

Asset registration
- By placing a blob into the landing zone, the Data Factory RegisterAsset is being triggered.
- This pipeline associates the landed blob to a certain Provider and Entity in Sidra.
- Asset registration is using the Entity's RegularExpression to associate the blob to a specific Entity under a Provider.
- An Asset Date is being determined during the asset registration.
- The blob is being moved: from where it was dropped, to the Provider's container inside the staging storage account.
- With this, the file blob becomes an Asset, allowing further processing.
File ingestion pipeline
- Upon successful registration, the RegisterAsset pipeline invokes the pipeline that is associated with the Entity.
- Hence, the FileIngestionDatabricks is being further executed.
- This second pipeline is invoking a Databricks notebook job runs, which will finish the ingestion.
Databricks notebook runs
- As a last phase, the following Databricks notebooks are being executed for the Asset:
  /Sidra/DSUIngestion/CreateTables
  /Sidra/DSUIngestion/TabularIngestion
- The target Databricks table definition will use the metadata defined in the PSD1 file.
  Its columns are based on the Attributes section from the PSD1.
- The CSV file is being parsed, and the records are copied into the target Databricks table.
- This table will contain the CSV information in Delta format.
- The data will be stored in Databricks in an incremental way: each new CSV file that corresponds to this Entity will append new records to the corresponding table in Databricks.
Ingestion finished
- Once the Data Factory pipelines finished execution, the Asset is marked as successfully ingested.
- Check the Sidra's Web management UI: Select the DSU to see the Provider, Entity, and the newly loaded Asset.
- Side note: the Asset will have [IdStatus]=2 in the [Core] DB, table [DataIngestion].[Asset].

Here we can see an example of what could be visualized in Databricks after loading two CSV files corresponding to the configured same Entity: + Provider: MyFirstCsvIngestion. + Entity: BikeStoreBikes. + You may download define-csv-entity.zip for a test; find inside the sample CSV files. + Databricks table: