Entity table¶

This is the information about the Entities that must be included when a new Entity table is stored in the metadata database:

Column	Description	Format	Required
Id	Entity identifier	int	Yes
IdProvider	Identifier of the Provider where this Entity will be associated to	int	Yes
Name	Name of the Entity. Space character is not allowed. The metadata API will sanitize spaces in characters if they are provided in Entity or Attribute names, by replacing with `_` character (only when created through the API)	varchar(256)	Yes
TableName	The name of the table created in Databricks to store the information of the Assets that are associated to this Entity. Space ' ' character must not be used; the Java libraries in Databricks will reject table names containing `" ,;{}()\n\t"`	varchar(256)	Yes
RegularExpression	Used to identify when a file will be associated to the Entity. If the filename matches to the regular expression of the Entity, it will be generated an Asset for that file and associated that Asset to the Entity. For example: `^myentity_((?<year>\\d{4})-(?<month>\\d{2})-(?<day>\\d{2}))_((?<hour>\\d{2})-(?<minute>\\d{2})-(?<second>\\d{2})).csv`	nvarchar(500)
StartValidDate	Date from which this Entity applies to the Asset. It could happen that, at any time, an Asset starts presenting an extra column because it has been a schema change in the data source. In order to preserve all previous data that has already been ingested in the system, there could be registered two Entities that are related to the same type of Asset, but generated in two different moments	datetime2(0)	Yes
EndValidDate	Date until this Entity applies to the file	datetime2(7)
Serde	Here the source format of the file is specified	varchar(100)
SerdeProperties	See SerdeProperties value documentation	varchar(100)
Encoding	See examples on the documentation site	varchar(50)
HeaderLines	Specifies the number of header lines of the Assets related to this Entity	smallint	Yes
FieldDelimiter	See examples on the documentation site	varchar(6)
LastUpdated	Date in which the Asset metadata has been updated and needs to be deployed to have the latest changes synchronized with the rest of the system	datetime2(0)	Yes
LastDeployed	Date in which the Asset metadata was deployed and synchronized with the rest of the system	datetime2(0)
Format	The format of the file. Currently it is supported `xlsx`, `xlsm`, `parquet` and `csv`	varchar(10)
NullText	If there is a special string used as a null value, it can be specified here	varchar(10)
ReCreateTableOnDeployment	If it is needed to create or recreate the table in the Data Lake. In case the table exists, it will be overwritten. Once the table has been created, this value changes to 0	bit	Yes
RowDelimiter	See examples on the documentation site	varchar(6)
FilesPerDrop	Sometimes one type of file can be generated from different sources. In this case, this parameter indicates the number of files of the Entity needed to consider that the data for a day is complete	int	Yes
SourcePath	Indicates the origin of the generated files. For example, in case of an SFTP server, this tells the path of the file inside it	nvarchar(max)
Description	Description of this Entity	nvarchar(max)
AdditionalProperties	JSON structure with additional properties used for extensibility, including the options: `readerOptions`, `consolidationMode` and `dataPreview`	varchar(max)
IdTableFormat	Identifier of the format of the table where this Entity is stored (ORC, Parquet and Delta are supported)	int
GenerateDeltaTable	When `true`, an additional table is created where the changes produced by each new Asset are traced	bit	Yes
ParentSecurityPath	The `SecurityPath` of the Provider to which the Entity belongs	varchar(100)	Yes
Image	Image used for the Entity in the Sidra Manager UI	varbinary(max)
CreationDate	Date of the creation	datetime2(7)	Yes
Detail	Markdown with Entity details	nvarchar(max)
ViewDefinition	Used when the Entity is a view instead of a table in the DSU	nvarchar(max)
ItemId	GUID of the Entity	uniqueidentifier	Yes
SecurityPath	Path of identifiers used for authorization	varchar(113)	Yes
IdDataStorageUnit	Identifier of the Data Storage Unit in which the Entity is contained	int
IdDataIntakeProcess	DIP Id related	int

For more information about AdditionalProperties configuration or GenerateDeltaTable configuration, please check their corresponding pages.

Versions

From version 2022.R3, Serde, SerdeProperties, RowDelimiter, FieldDelimiter and Encoding columns can be configurable within the readerOptions field in the column AdditionalProperties as fields. These fields still remain in the table due to backward compatibility with other versions.

HeaderLines column¶

With this column is possible to define the number of lines used as header. The are some considerations to have about it:

readerOptions values have priority in the case that both (HeaderLines and readerOptions) are declared.
Just in the case that readerOptions indicates that there is a header and HeaderLines has a value bigger than 1, the content of the load will be exempt of the header lines pointed by the HeaderLines column.

Examples:

Case 1. readerOptions: {"header": false} and HeaderLines = 1.readerOptions has priority and there is not header.
Case 2. readerOptions: {"header": true} / HeaderLines = 0. readerOptions has priority and, by default, there is one header line.
Case 3. readerOptions: {"header": true} / HeaderLines = 3. readerOptions has priority and there is a header of the first 3 lines as specified.
Case 4. readerOptions: {"header": false} / HeaderLines = 3. readerOptions has priority and there is not header.

RegularExpression column¶

Sometimes this field must be agreed with other actors. That means, for instance, if the system is going to load files from an FTP and a third party is going to leave the files in a specific FTP server, DEV team has to discuss with the other team what is going to be the naming convention so that the ingestion process can recognize the file via RegularExpression. This scenario applies for FTP/landing zone file drops. When pulling from data sources such as SQL databases, the DEV team will be in charge of defining this and no interaction with other teams is required.