Skip to content

AdditionalProperties Configuration

This article is aimed to describe the different configurations for the AdditionalProperties column in the Entity table.

AdditionalProperties column

Some of the options of this column are detailed below.

1. readerOptions field

  ≥ 2022.R3 

This field is used to read the source/raw file. In case of database sources, the process generates a parquet file. This property is optional and its format it is a JSON defined as a string. Its structure and its default values can vary depending on the type of file to read (.parquet, .csv, Excel -.xlsm o .xlsx-, Excel - .xls-):

  • For .csv, .xlsx, and .xlsm files, according to the Apache Spark documentation and, in the case to declare a "header" parameter, it would be like:

        {"header": true}
    
    If some parameter is declared and it is not defined among the reading options of that format, the file will be loaded without taking into account that property and without raising any error.

  • For .xls files, according to the Pandas documentation:

        {"header": 0}
    
    If some parameter is declared and it is not defined among the reading options of that format, the file will not be loaded and it will raise an error.

  • For .parquet files reading, the Spark reader is used according to the options contemplated.

Separator value

Related to the separator value in csv files, by default, it is frequently used "," and, for the line break, "\r", "\r\n" and "\n". Some examples are shown below.

Case 1
  • Content file:
    Name;Code
    "Congo;
        the Democratic Republic of the";CD
    Réunion;RE
    
  • readerOptions definition:
    {
        "delimiter": ";", 
        "encoding": "UTF-8", 
        "header": true, 
        "multiLine": true
    }
    
Case 2
  • Content file:
    Name;Code¦"Congo;
        the Democratic Republic of the";CD¦Réunion;RE
    
  • readerOptions definition:
    {
        "delimiter": ";", 
        "encoding": "UTF-8", 
        "header": true, 
        "multiLine": true,
        "lineSep":"¦"
    }
    

Versions

From version 2022.R3, Serde, SerdeProperties, RowDelimiter, FieldDelimiter and Encoding columns will be configurable within the readerOptions field in the column AdditionalProperties as fields.

Serde value

Here the source format of the file is specified. The value is often "csv" but it could be one of the allowed values for the spark.read.format() function, which, as of today are: json, parquet, jdbc, orc, libsvm, csv, text.

SerdeProperties value

Depending on which source format the system ingests, the SerdeProperties may vary. To provide more context about this column, the content will be passed to the options function in Databricks. For instance, using the following configuration: Serde = 'csv', SerdeProperties = '"separatorChar" = "\u0001","quoteChar"= "\u0003","escapeChar"= "\u0002"', the outcome in Databricks Spark will be something like: read.format('csv').options(separatorChar"="\u0001","quoteChar"= "\u0003","escapeChar"= "\u0002").

2. consolidationMode field

It can take the values: Merge and Snapshot, when selecting:

  • Merge, the data will be merged.
  • Snapshot, default value, the data will be appended.

3. dataPreview field

The values for this option are: true and false. When true, it generates the data preview for the Data Catalog in Sidra Website.

4. piiDetectionEnabled field

This field will allow the PII detection in the Assets of the Entity as part of the data load.

More information about PII detection can be checked here.

5. language field

Indicates the language of the data. The values for this option are: en and es. This option is used for PII detection.

6. isEncrypted field

The default value of this field is false.

Versions

From 2022.R3 onwards, this field will not be used.

7. Origin of databases types field

It can change depending on the database or origin type as follows:

  • sourceTableName: Name of the complete origin table with the specific origin format. For example, format used for qualified table names:
    • SQL Server: [TABLE_CATALOG].[TABLE_SCHEMA].[TABLE_NAME]
    • DB2: "TABLE_SCHEMA"."TABLE_NAME"
    • MYSQL: TABLE_SCHEMA.TABLE_NAME
  • sourceServer: Name of the source SQL server.
  • sourceDatabase: Name of the source database.
  • sourceSchema: Name of the origin table name.
  • sourceTable: Name of the origin table.
  • GoogleAnalytics. For example:

    {
    "ViewId" : "271455630",
    "RequestData" :  {
            "dimensions": [
                {
                    "name": "ga:date"
                },
                {
                    "name": "ga:dimension3"
                },
                {
                    "name": "ga:dimension2"
                },
                {
                    "name": "ga:pagePath"
                }
            ],
            "metrics": [
                {
                    "expression": "ga:hits"
                }
            ]
        }
    }
    

    Where:

    • viewId: Google Analytics View ID in the property.
    • requestData: Google Analytics dimensions/metrics request. It will be used for creating the metadata and also, by the data intake pipeline.

8. TransferQuery field

When defined, it indicates the notebook path used for loading the data in the DSU. If it is not declared, then it will be used the notebook defined by default, the DSU ingestion. Please, refer to use cases below.

9. anomalyDetection field

When true, the anomaly detection is activated.

10. entitySchema field

This field is used in the case of Excel files ingestion non-structured in a tabular format. It indicates how to read the Excel file, selecting the correct cells to ingest or a range of them.

More information can be checked in the specific page for Excel file ingestion .

11. SearchServices field

For example:

{
"SearchServices" : { 
    "FilePathColumn" : "Path",
    "ProjectionsColumn": "Projections",
    "IndexNameColumn": "IndexName",
    "FilterExpression": "Name eq 'Factura.xlsx'", "OffsetToConvertUTCToSharePointTimeZone": "1" 
    }
}
Where:

  • FilePathColumn, ProjectionsColumn, and IndexNameColumn define Databricks table column mappings.
  • FilterExpression and OffsetToConvertUTCToSharePointTimeZone are specific of the SharePoint connector (i.e. FilterExpression is used to filter files coming from selected SharePoint folders).

12. useLegacyTransferQuery field

If this field is set as true, for the data loads which use RegisterAsset and FileIngestionDatabricks pipelines (Data Intake Process via landing zone), the legacy Transfer Query will be used. If false or not declared, the new DSU Ingestion script will be used.

Note that if the value in the TransferQuery field is defined, then, that notebook indicated with the corresponding path will be used.

Use cases for TransferQuery and useLegacyTransferQuery fields

  • For complex Excel Data Intake:
{
    
    "TransferQuery" : "Excel/ExcelDataIntake",
    
}
  • For using a legacy Transfer Query, note that it must exist before:
{
    
    "UseLegacyTransferQuery" : "true",
    
}
  • For using a custom ingestion script, take into account that the path must be the absolute path:
{
    
    "TransferQuery" : "/folder/{ProviderName}folder/{TableName}",
    
}


Sidra Ideas Portal


Last update: 2022-11-15
Back to top