Entity table¶
This is the information about the Entities that must be included when a new Entity table is stored in the metadata database:
Column | Description | Format | Required |
---|---|---|---|
Id | Entity identifier | int | Yes |
IdProvider | Identifier of the Provider where this Entity will be associated to | int | Yes |
Name | Name of the Entity. Space character is not allowed. The metadata API will sanitize spaces in characters if they are provided in Entity or Attribute names, by replacing with _ character (only when created through the API) | varchar(256) | Yes |
TableName | The name of the table created in Databricks to store the information of the Assets that are associated to this Entity. Space ' ' character must not be used; the Java libraries in Databricks will reject table names containing " ,;{}()\n\t" | varchar(256) | Yes |
RegularExpression | Used to identify when a file will be associated to the Entity. If the filename matches to the regular expression of the Entity, it will be generated an Asset for that file and associated that Asset to the Entity. For example: ^myentity_((?<year>\\d{4})-(?<month>\\d{2})-(?<day>\\d{2}))_((?<hour>\\d{2})-(?<minute>\\d{2})-(?<second>\\d{2})).csv | nvarchar(500) | |
StartValidDate | Date from which this Entity applies to the Asset. It could happen that, at any time, an Asset starts presenting an extra column because it has been a schema change in the data source. In order to preserve all previous data that has already been ingested in the system, there could be registered two Entities that are related to the same type of Asset, but generated in two different moments | datetime2(0) | Yes |
EndValidDate | Date until this Entity applies to the file | datetime2(7) | |
Serde | Here the source format of the file is specified | varchar(100) | |
SerdeProperties | See SerdeProperties value documentation | varchar(100) | |
Encoding | See examples on the documentation site | varchar(50) | |
HeaderLines | Specifies the number of header lines of the Assets related to this Entity | smallint | Yes |
FieldDelimiter | See examples on the documentation site | varchar(6) | |
LastUpdated | Date in which the Asset metadata has been updated and needs to be deployed to have the latest changes synchronized with the rest of the system | datetime2(0) | Yes |
LastDeployed | Date in which the Asset metadata was deployed and synchronized with the rest of the system | datetime2(0) | |
Format | The format of the file. Currently it is supported xlsx , xlsm , parquet and csv | varchar(10) | |
NullText | If there is a special string used as a null value, it can be specified here | varchar(10) | |
ReCreateTableOnDeployment | If it is needed to create or recreate the table in the Data Lake. In case the table exists, it will be overwritten. Once the table has been created, this value changes to 0 | bit | Yes |
RowDelimiter | See examples on the documentation site | varchar(6) | |
FilesPerDrop | Sometimes one type of file can be generated from different sources. In this case, this parameter indicates the number of files of the Entity needed to consider that the data for a day is complete | int | Yes |
SourcePath | Indicates the origin of the generated files. For example, in case of an SFTP server, this tells the path of the file inside it | nvarchar(max) | |
Description | Description of this Entity | nvarchar(max) | |
AdditionalProperties | JSON structure with additional properties used for extensibility, including the options: readerOptions , consolidationMode and dataPreview | varchar(max) | |
IdTableFormat | Identifier of the format of the table where this Entity is stored (ORC, Parquet and Delta are supported) | int | |
GenerateDeltaTable | When true , an additional table is created where the changes produced by each new Asset are traced | bit | Yes |
ParentSecurityPath | The SecurityPath of the Provider to which the Entity belongs | varchar(100) | Yes |
Image | Image used for the Entity in the Sidra Manager UI | varbinary(max) | |
CreationDate | Date of the creation | datetime2(7) | Yes |
Detail | Markdown with Entity details | nvarchar(max) | |
ViewDefinition | Used when the Entity is a view instead of a table in the DSU | nvarchar(max) | |
ItemId | GUID of the Entity | uniqueidentifier | Yes |
SecurityPath | Path of identifiers used for authorization | varchar(113) | Yes |
IdDataStorageUnit | Identifier of the Data Storage Unit in which the Entity is contained | int | |
IdDataIntakeProcess | DIP Id related | int |
For more information about AdditionalProperties configuration or GenerateDeltaTable configuration, please check their corresponding pages.
Versions
From version 2022.R3, Serde, SerdeProperties, RowDelimiter, FieldDelimiter and Encoding columns can be configurable within the readerOptions field in the column AdditionalProperties as fields. These fields still remain in the table due to backward compatibility with other versions.
HeaderLines column¶
With this column is possible to define the number of lines used as header. The are some considerations to have about it:
readerOptions
values have priority in the case that both (HeaderLines
andreaderOptions
) are declared.- Just in the case that
readerOptions
indicates that there is a header andHeaderLines
has a value bigger than 1, the content of the load will be exempt of the header lines pointed by theHeaderLines
column.
Examples:
-
Case 1.
readerOptions
:{"header": false}
andHeaderLines = 1
.readerOptions
has priority and there is not header. -
Case 2.
readerOptions
:{"header": true}
/HeaderLines = 0
.readerOptions
has priority and, by default, there is one header line. -
Case 3.
readerOptions
:{"header": true}
/HeaderLines = 3
.readerOptions
has priority and there is a header of the first 3 lines as specified. -
Case 4.
readerOptions
:{"header": false}
/HeaderLines = 3
.readerOptions
has priority and there is not header.
RegularExpression column¶
Sometimes this field must be agreed with other actors. That means, for instance, if the system is going to load files from an FTP and a third party is going to leave the files in a specific FTP server, DEV team has to discuss with the other team what is going to be the naming convention so that the ingestion process can recognize the file via RegularExpression. This scenario applies for FTP/landing zone file drops. When pulling from data sources such as SQL databases, the DEV team will be in charge of defining this and no interaction with other teams is required.