AdditionalProperties Configuration¶

This article is aimed to describe the different configurations for the AdditionalProperties column in the Entity table.

AdditionalProperties column¶

Some of the options of this column are detailed below.

1. readerOptions field¶

This field is used to read the source/raw file. In case of database sources, the process generates a parquet file. This property is optional and its format it is a JSON defined as a string. Its structure and its default values can vary depending on the type of file to read (.parquet, .csv):

For .csv, .xlsx, and .xlsm files, according to the Apache Spark documentation and, in the case to declare a "header" parameter, it would be like:
```
    {"header": true}
```
If some parameter is declared and it is not defined among the reading options of that format, the file will be loaded without taking into account that property and without raising any error.
For .xls files, according to the Pandas documentation:
```
    {"header": 0}
```
If some parameter is declared and it is not defined among the reading options of that format, the file will not be loaded and it will raise an error.
For .parquet files reading, the Spark reader is used according to the options contemplated.

Separator value¶

Related to the separator value in csv files, by default, it is frequently used "," and, for the line break, "\r", "\r\n" and "\n". Some examples are shown below.

Case 1¶

Content file:

Name;Code
"Congo;
    the Democratic Republic of the";CD
Réunion;RE

readerOptions definition:

{
    "delimiter": ";", 
    "encoding": "UTF-8", 
    "header": true, 
    "multiLine": true
}

Case 2¶

Content file:

Name;Code¦"Congo;
    the Democratic Republic of the";CD¦Réunion;RE

readerOptions definition:

{
    "delimiter": ";", 
    "encoding": "UTF-8", 
    "header": true, 
    "multiLine": true,
    "lineSep":"¦"
}

Versions

From version 2022.R3, Serde, SerdeProperties, RowDelimiter, FieldDelimiter and Encoding columns will be configurable within the readerOptions field in the column AdditionalProperties as fields. These fields still remain in the table due to backward compatibility with other versions.

Serde value¶

Here the source format of the file is specified. The value is often "csv" but it could be one of the allowed values for the spark.read.format() function, which, as of today are: json, parquet, jdbc, orc, libsvm, csv, text.

SerdeProperties value¶

Depending on which source format the system ingests, the SerdeProperties may vary. To provide more context about this column, the content will be passed to the options function in Databricks. For instance, using the following configuration: Serde = 'csv', SerdeProperties = '"separatorChar" = "\u0001","quoteChar"= "\u0003","escapeChar"= "\u0002"', the outcome in Databricks Spark will be something like: read.format('csv').options(separatorChar"="\u0001","quoteChar"= "\u0003","escapeChar"= "\u0002").

2. consolidationMode field¶

It can take the values: Merge and Snapshot, when selecting:

Merge, the data will be merged.
Snapshot, default value, the data will be appended.

Info

To configure consolidationMode field in additionalProperties column of the Entity table, use the Sidra API by getting first the Entity Id and updating later through the update endpoint (/api/metadata/entities/{id}). More information about the endpoints can be checked in the documentation site for it.

Do not confuse with consolidation mode options on Data Products, configurable through pipelineExecutionProperties column.

3. dataPreview field¶

The values for this option are: true and false. When true, it generates the data preview for the Data Catalog in Sidra Website.

4. piiDetectionEnabled field¶

This field will allow the PII detection in the Assets of the Entity as part of the data load.

More information about PII detection can be checked here.

5. language field¶

Indicates the language of the data. The values for this option are: en and es. This option is used for PII detection.

6. isEncrypted field¶

The default value of this field is false.

Versions

From 2022.R3 onwards, this field will not be used.

7. Origin of databases types field¶

It can change depending on the database or origin type as follows:

sourceTableName: Name of the complete origin table with the specific origin format. For example, format used for qualified table names:
- SQL Server: [TABLE_CATALOG].[TABLE_SCHEMA].[TABLE_NAME]
- DB2: "TABLE_SCHEMA"."TABLE_NAME"
- MYSQL: TABLE_SCHEMA.TABLE_NAME
sourceServer: Name of the source SQL server.
sourceDatabase: Name of the source database.
sourceSchema: Name of the origin table name.
sourceTable: Name of the origin table.

GoogleAnalytics. For example:

{
"ViewId" : "271455630",
"RequestData" :  {
        "dimensions": [
            {
                "name": "ga:date"
            },
            {
                "name": "ga:dimension3"
            },
            {
                "name": "ga:dimension2"
            },
            {
                "name": "ga:pagePath"
            }
        ],
        "metrics": [
            {
                "expression": "ga:hits"
            }
        ]
    }
}

Where:

viewId: Google Analytics View ID in the property.
requestData: Google Analytics dimensions/metrics request. It will be used for creating the metadata and also, by the data intake pipeline.

8. TransferQuery field¶

When defined, it indicates the notebook path used for loading the data in the DSU. If it is not declared, then it will be used the notebook defined by default, the DSU ingestion. Please, refer to use cases below.

9. anomalyDetection field¶

Anomaly detection property is set as true by default. This means that it will be always executed inside DSUIngestion notebook with the requisite of having at least 10 Assets per Entity. Otherwise, the following message will be thrown:

Anomaly Detection skipped: Too few Assets.

10. entitySchema field¶

This field is used in the case of Excel files ingestion non-structured in a tabular format. It indicates how to read the Excel file, selecting the correct cells to ingest or a range of them.

This version of Sidra does not support Excel ingestion although this feature will be implemented in future versions.

11. SearchServices field¶

For example:

{
"SearchServices" : { 
    "FilePathColumn" : "Path",
    "ProjectionsColumn": "Projections",
    "IndexNameColumn": "IndexName",
    "FilterExpression": "Name eq 'Factura.xlsx'", "OffsetToConvertUTCToSharePointTimeZone": "1" 
    }
}

Where:

FilePathColumn, ProjectionsColumn, and IndexNameColumn define Databricks table column mappings.
FilterExpression and OffsetToConvertUTCToSharePointTimeZone are specific of the SharePoint connector (i.e. FilterExpression is used to filter files coming from selected SharePoint folders).

12. useLegacyTransferQuery field¶

If this field is set as true, for the data loads which use RegisterAsset and FileIngestionDatabricks pipelines (Data Intake Process via landing zone), the legacy Transfer Query will be used. If false or not declared, the new DSU Ingestion script will be used.

Note that if the value in the TransferQuery field is defined, then, that notebook indicated with the corresponding path will be used.

Use cases for TransferQuery and useLegacyTransferQuery fields¶

For using a legacy Transfer Query, note that it must exist before:

{
    …
    "UseLegacyTransferQuery" : "true",
    …
}

For using a custom ingestion script, take into account that the path must be the absolute path:

{
    …
    "TransferQuery" : "/folder/{ProviderName}folder/{TableName}",
    …
}