This article is aimed to describe the different configurations for the
AdditionalProperties column in the Entity table.
Some of the options of this column are detailed below.
1. readerOptions field¶
This field is used to read the source/raw file. In case of database sources, the process generates a parquet file. This property is optional and its format it is a JSON defined as a string. Its structure and its default values can vary depending on the type of file to read (.parquet, .csv, Excel -.xlsm o .xlsx-, Excel - .xls-):
For .csv, .xlsx, and .xlsm files, according to the Apache Spark documentation and, in the case to declare a "header" parameter, it would be like:
For .xls files, according to the Pandas documentation:
For .parquet files reading, the Spark reader is used according to the options contemplated.
Related to the separator value in
csv files, by default, it is frequently used "," and, for the line break, "\r", "\r\n" and "\n". Some examples are shown below.
- Content file:
- Content file:
From version 2022.R3, Serde, SerdeProperties, RowDelimiter, FieldDelimiter and Encoding columns will be configurable within the readerOptions field in the column AdditionalProperties as fields.
Here the source format of the file is specified. The value is often "csv" but it could be one of the allowed values for the
spark.read.format() function, which, as of today are: json, parquet, jdbc, orc, libsvm, csv, text.
Depending on which source format the system ingests, the SerdeProperties may vary. To provide more context about this column, the content will be passed to the options function in Databricks. For instance, using the following configuration: Serde = 'csv', SerdeProperties = '"separatorChar" = "\u0001","quoteChar"= "\u0003","escapeChar"= "\u0002"', the outcome in Databricks Spark will be something like:
read.format('csv').options(separatorChar"="\u0001","quoteChar"= "\u0003","escapeChar"= "\u0002").
2. consolidationMode field¶
It can take the values:
Snapshot, when selecting:
Merge, the data will be merged.
Snapshot, default value, the data will be appended.
3. dataPreview field¶
The values for this option are:
true, it generates the data preview for the Data Catalog in Sidra Website.
4. piiDetectionEnabled field¶
This field will allow the PII detection in the Assets of the Entity as part of the data load.
More information about PII detection can be checked here.
5. language field¶
Indicates the language of the data. The values for this option are:
es. This option is used for PII detection.
6. isEncrypted field¶
The default value of this field is
From 2022.R3 onwards, this field will not be used.
7. Origin of databases types field¶
It can change depending on the database or origin type as follows:
sourceTableName: Name of the complete origin table with the specific origin format. For example, format used for qualified table names:
- SQL Server: [TABLE_CATALOG].[TABLE_SCHEMA].[TABLE_NAME]
- DB2: "TABLE_SCHEMA"."TABLE_NAME"
sourceServer: Name of the source SQL server.
sourceDatabase: Name of the source database.
sourceSchema: Name of the origin table name.
sourceTable: Name of the origin table.
GoogleAnalytics. For example:
viewId: Google Analytics View ID in the property.
requestData: Google Analytics dimensions/metrics request. It will be used for creating the metadata and also, by the data intake pipeline.
8. TransferQuery field¶
When defined, it indicates the notebook path used for loading the data in the DSU. If it is not declared, then it will be used the notebook defined by default, the DSU ingestion. Please, refer to use cases below.
9. anomalyDetection field¶
true, the anomaly detection is activated.
10. entitySchema field¶
This field is used in the case of Excel files ingestion non-structured in a tabular format. It indicates how to read the Excel file, selecting the correct cells to ingest or a range of them.
More information can be checked in the specific page for Excel file ingestion .
11. SearchServices field¶
IndexNameColumndefine Databricks table column mappings.
OffsetToConvertUTCToSharePointTimeZoneare specific of the SharePoint connector (i.e. FilterExpression is used to filter files coming from selected SharePoint folders).
12. useLegacyTransferQuery field¶
If this field is set as
true, for the data loads which use
FileIngestionDatabricks pipelines (Data Intake Process via landing zone), the legacy Transfer Query will be used. If
false or not declared, the new DSU Ingestion script will be used.
Note that if the value in the
TransferQuery field is defined, then, that notebook indicated with the corresponding path will be used.
Use cases for TransferQuery and useLegacyTransferQuery fields¶
- For complex Excel Data Intake:
- For using a legacy Transfer Query, note that it must exist before:
- For using a custom ingestion script, take into account that the path must be the absolute path: