Skip to content

Entity table

This is the information about the Entities that must be included when a new Entity table is stored in the metadata database:

Column Description Format Required
Id Entity identifier int
IdProvider Identifier of the Provider where this Entity will be associated to int
Name Name of the Entity. Space character is not allowed. From version 2022.R2 1.12.2 update, the metadata API will sanitize spaces in characters if they are provided in Entity or Attribute names, by replacing with _ character (only when created through the API) varchar(256)
TableName The name of the table created in Databricks to store the information of the Assets that are associated to this Entity. Space ' ' character must not be used; the Java libraries in Databricks will reject table names containing " ,;{}()\n\t" varchar(256)
RegularExpression Used to identify when a file will be associated to the Entity. If the filename matches to the regular expression of the Entity, it will be generated an Asset for that file and associated that Asset to the Entity. For example: ^myentity_((?<year>\\d{4})-(?<month>\\d{2})-(?<day>\\d{2}))_((?<hour>\\d{2})-(?<minute>\\d{2})-(?<second>\\d{2})).csv nvarchar(500)
StartValidDate Date from which this Entity applies to the Asset. It could happen that, at any time, an Asset starts presenting an extra column because it has been a schema change in the data source. In order to preserve all previous data that has already been ingested in the system, there could be registered two Entities that are related to the same type of Asset, but generated in two different moments datetime2(0)
EndValidDate Date until this Entity applies to the file datetime2(7)
HeaderLines Specifies the number of header lines of the Assets related to this Entity smallint
LastUpdated Date in which the Asset metadata has been updated and needs to be deployed to have the latest changes synchronized with the rest of the system datetime2(0)
LastDeployed Date in which the Asset metadata was deployed and synchronized with the rest of the system datetime2(0)
Format The format of the file. Currently it is supported xlsx, xlsm, parquet and csv varchar(10)
NullText If there is a special string used as a null value, it can be specified here varchar(10)
ReCreateTableOnDeployment If it is needed to create or recreate the table in the Data Lake. In case the table exists, it will be overwritten. Once the table has been created, this value changes to 0 bit
FilesPerDrop Sometimes one type of file can be generated from different sources. In this case, this parameter indicates the number of files of the Entity needed to consider that the data for a day is complete int
SourcePath Indicates the origin of the generated files. For example, in case of an SFTP server, this tells the path of the file inside it nvarchar(max)
Description Description of this Entity nvarchar(max)
AdditionalProperties JSON structure with additional properties used for extensibility, including the options: readerOptions, consolidationMode and dataPreview varchar(max)
IdTableFormat Identifier of the format of the table where this Entity is stored (ORC, Parquet and Delta are supported) int
GenerateDeltaTable When true, an additional table is created where the changes produced by each new Asset are traced bit
ParentSecurityPath The SecurityPath of the Provider to which the Entity belongs varchar(100)
Image Image used for the Entity in the Sidra Manager UI varbinary(max)
CreationDate Date of the creation datetime2(7)
Detail Markdown with Entity details nvarchar(max)
ViewDefinition Used when the Entity is a view instead of a table in the DSU nvarchar(max)
ItemId GUID of the Entity uniqueidentifier
SecurityPath Path of identifiers used for authorization varchar(113)
IdDataStorageUnit Identifier of the Data Storage Unit in which the Entity is contained int

For more information about AdditionalProperties configuration or GenerateDeltaTable configuration, please check their corresponding pages.

HeaderLines column

With this column is possible to define the number of lines used as header. The are some considerations to have about it:

  1. readerOptions values have priority in the case that both (HeaderLines and readerOptions) are declared.
  2. Just in the case that readerOptions indicates that there is a header and HeaderLines has a value bigger than 1, the content of the load will be exempt of the header lines pointed by the HeaderLines column.

Examples:

  • Case 1. readerOptions: {"header": false} and HeaderLines = 1.readerOptions has priority and there is not header.

  • Case 2. readerOptions: {"header": true} / HeaderLines = 0. readerOptions has priority and, by default, there is one header line.

  • Case 3. readerOptions: {"header": true} / HeaderLines = 3. readerOptions has priority and there is a header of the first 3 lines as specified.

  • Case 4. readerOptions: {"header": false} / HeaderLines = 3. readerOptions has priority and there is not header.

RegularExpression column

Sometimes this field must be agreed with other actors. That means, for instance, if the system is going to load files from an FTP and a third party is going to leave the files in a specific FTP server, DEV team has to discuss with the other team what is going to be the naming convention so that the ingestion process can recognize the file via RegularExpression. This scenario applies for FTP/landing zone file drops. When pulling from data sources such as SQL databases, the DEV team will be in charge of defining this and no interaction with other teams is required.


Sidra Ideas Portal


Last update: 2022-11-15
Back to top