Attribute table¶
An Attribute defines the metadata information of a column of the content, for example the type of the values in the column, the order respect other columns of the same Entity, the name, etc. It also contains information about how the content will be ingested into the Data Lake. Each Attribute belongs to an Entity.
Attribute contains data about the schema of the files. This information is used to identify columns in a file, create Databricks tables, and, when necessary, create SQL tables in which the data is going to be extracted from Data Lake (this is for Client Applications). Most of the metadata about the columns is used to validate fields.
Column | Description | Format | Required |
---|---|---|---|
Id | Attribute identifier | int | |
IdEntity | The identifier of the related Entity | int | |
Name | Name of the column. Space character is not allowed. From version 2022.R2 1.12.2 update, the metadata API will sanitize spaces in characters if they are provided in Entity or Attribute names, by replacing with _ character (only when created through the API) |
varchar(150) | |
HiveType | The type of the column. It will be used to generate the Hive table | varchar(30) | |
MaxLen | Maximum size of the column. This applies to columns with STRING type. A validation error will be raised if the value in the file content is bigger than the number of characters for this field | int | |
IsNullable | Tells if the column can have null values or not. A validation error will be raised if the field is filled with NULL and it is not nullable | bit | |
NeedTrim | Tells if the columns need to be trimmed, taking out leading and trailing spaces in the value | bit | |
ReplacedText | Indicates a text in the column ReplacedText to be replaced by ReplacementText value |
varchar(5) | |
ReplacementText | The text to replace the ReplacedText with |
varchar(5) | |
SpecialFormat | If the column need to pass a validation that cannot be enforced using the rest of the columns, it is specified here. For example, to validate a date’s format. If the field of the column (which defines the content of this column) is calculated, then, this field can describe how to calculate the column value itself. | varchar(512) | |
TreatEmptyAsNull | Tells if empty values should be considered nulls. | bit | |
IsPrimaryKey | Used when the IsMerged in Entity, is true to determine what data is the first key in origin and new. For plain text files, it is quite often set to false for all the Attributes |
bit | |
Order | The order of the column inside the file | smallint | |
IsCalculated | Tells if the column is calculated, and not a column inside the file. In SpecialFormat column can be checked how to obtain the column value. If SpecialFormat is not defined, an error will be displayed |
bit | |
IsPartitionColumn | Tells if the column is used in the table partition or not. True applies to the fields used like partition when storing the data in the DSU | bit | |
IsMetadata | Tells if the column is metadata for the system, and not data of the file itself | bit | |
SQLType | Indicates the type of the column to create a SQL table. Example values: "INT", "DATETIME2", "BIT", "DATE", "DECIMAL" | varchar(30) | |
ValidationText | Defines specific validations to be performed for this data instead of using the validations generated by the GenerateTransferQuery custom activity. The content of this field is applied to the field which defines it, if not matching, a validation error will be displayed. When a value for ValidationText is provided, any other validation is overriden and the only one applied is the ValidationText |
varchar (512) | |
Description | Description of this Attribute | nvarchar(max) | |
SourceName | Name of the column in the data source. It is used in case it is different to the name used in Sidra | varchar(150) | |
ParentSecurityPath | The SecurityPath of the Entity to which the Attribute belongs |
varchar (100) | |
IsEncrypted | Can be used to encrypt the values for an Attribute in the Data Lake | bit | |
DataMask | This field indicates the Data masking rule that will be applied in the preview of this field in the Data Catalog | varchar(200) | |
ItemId | GUID of the Attribute. | ||
SourceType | Represents the type that the Attribute has in the source system (DB type in the case of databases) | uniqueidentifier | |
SecurityPath | Path of identifiers used for authorization | varchar(50) | |
IdDataStorageUnit | Identifier of the Data Storage Unit in which the Entity is contained | int | |
IdProvider | Identifier of the Provider that generates this type of file | int |
TreatEmptyAsNull column¶
If it is true and the origin value is "empty", then the final result is NULL. This condition affects the result in the case the NULL value is accepted or not.
For CSV files, the reading of an "empty" value equals to a NULL value by default, regardless how TreatEmptyAsNull
field is configured.
DataMask column¶
DataMask relies on the Microsoft SQL server implementation. For DataMask functions, please refer to the list with all SQL Server functions from Microsoft documentation. An example of such value of DataMask field can be:
, which will mean that a custom data masking function will be applied,where only the first character in the string will not be masked.
IsEncrypted column¶
Now, Attributes from an existing Entity can be encrypted. This means that, when loading data in the Databricks table, the legacy autogenerated transfer query / DSU ingestion script will encrypt all Attributes whose isEncrypted
column is set to true. When querying the data on that table, encrypted columns will show a random string:
In addition, assets from an Entity can be encrypted as well, so anyone who has access to them, will not able to read the contents unless they are previously decrypted. To enable this, it is necessary to update the AdditionalProperties
field from the Entity table, and adding to the JSON the following property:
Apart from that, it is necessary to insert, in the EntityPipeline
table, a relationship between the Entity whose Assets are going to be encrypted and the pipeline FileIngestionWithEncryption
.
Lastly, to enable any kind of encryption is necessary to generate two parameters that must be present in the deployment library, which are:
EncryptionKey
: It is a 16, 24 or 32 bytes long string randomly generated.EncryptionInitializationVector
: It is a 16 bytes long string randomly generated.
SQLType column¶
Many Client Applications end up storing data extracted from the Data Lake in an Azure SQL Database. In order to auto-generate the tables to hold that data, this column contains specific SQL Server types such as decimal
with scale and precision, or datetime2
.
IsPartitionColumn column¶
As it has been described in previous sections, data is stored in Data Lake in a optimized way. One of these optimization steps is by partitioning, so extraction can be more accurate and faster. Setting this column to true will make the table created in the Data Lake to be partitioned by this column. It is important to point that, if there are 2 columns acting as partition column, the order will impact on how the table is partitioned. That is, in the example below partitions will be: FileDate
then IdSourceItem
. It is important to make sure that columns are in the correct order.
IsPrimaryKey¶
Only used for "merged" entities. For plain text files, it is quite often set to false for all the Attributes.
ValidationText¶
When a value for ValidationText
is provided, any other validation is overriden and the only one applied is the ValidationText
. You can see an example of configuration here.
Versions
From Sidra 2022.R3 version onwards, the field RemoveQuotes has been deprecated.
Sidra system Attributes¶
There are five Attributes that are added to the original set of specific Attributes for the Entity. They are used by Sidra for multiple purposes and it is mandatory to have them. Those Attributes are: SourceByteOffset
, LoadDate
, HasErrors
, FileDate
and IdSourceItem
.
By default, Sidra partitions the Data Lake tables by FileDate
and IdSourceItem
, but it is up to the team configuring the system to choose which partitions to use. Sidra team strongly encourages to use these two Attributes or, at least, IdSourceItem
which is widely used by the platform and has a good effect on its performance.