Skip to content

How to add a new Entity

This tutorial explains how to to add a new Entity in Sidra.

IMPORTANT

Using the Sidra API to create a new Entity is preferred, since the API will perform the needed validation checks, redusing the chances for errors.

This is required for example when a new type of file has to be loaded from a FTP server or new Entities are going to be pulled from an API. When adding a new Entity, there are two possible scenarios:

  • Scenario 1: the Entity belongs to an existing Provider.
  • Scenario 2: A new Provider must be configured. If the Provider is not created yet, this tutorial How to add a new Provider can be used to create it. Entities are stored in the Entity table in the Sidra Core metadata database.

There are two ways to create a new Entity in Sidra:

  • Create a new Entity by creating an SQL script to insert the new Entity in the database.
  • Create a new Entity by using the Sidra API endpoint to add new Entities.

The method used will depend, for example, on having access to the source code of the solution -where the SQL scripts must be placed to be executed- or having to integrate with a third-party tool. In the latter case, the Sidra API is the recommended approach.

Information

Take into account that Sidra metadata tables do not allow for space characters in the names of Attributes or Entities. From version 2022.R2 1.12.2 update, the metadata API will sanitize spaces in characters if they are provided in Entity or Attribute names, by replacing with `_` character. Note that, for adding metadata, the name sanitation will be only done by the API, the SQL scripts will not do it, so it will be required to avoid spaces in Attribute or Entities names by this method.

Add a new Entity by using an SQL script

Add an SQL script following the naming conventions to the Scripts\CoreContext folder or to the place configured in the DatabaseBuilder from where it retrieves the scripts. A sample of the script to add a new Entity can be found below:

-- ENTITY CONFIGURATION

DECLARE @Id_Provider INT = 10
DECLARE @Id_Dsu INT = '{ID DSU}'
DECLARE @EntityItemId UNIQUEIDENTIFIER = NEWID()
DECLARE @CurrentDate DATETIME2 = GETUTCDATE()
DECLARE @EntityName VARCHAR(MAX) = 'MyNewEntity'
DECLARE @Entity_RegularExpression NVARCHAR(MAX) = '^myentity_((?<year>\d{4})-(?<month>\d{2})-(?<day>\d{2}))_((?<hour>\d{2})-(?<minute>\d{2})-(?<second>\d{2})).csv'
DECLARE @EntityFormat VARCHAR(10) = 'csv'
DECLARE @EntityEncoding VARCHAR(50) = 'UTF-8'
DECLARE @EntityDelimiter VARCHAR(6) = '\u0009'
DECLARE @EntityHeaderLines INT = 1
DECLARE @EntityProperties VARCHAR(MAX) = CONCAT('{"sourceTableName": "[staging].[', @EntityName, ']"}')
DECLARE @Entity_ParentSecurityPath VARCHAR(100) = CONCAT(@Id_Dsu, '/', @Id_Provider)

-- ENTITY CREATION

INSERT INTO [DataIngestion].[Entity]
    ([IdProvider], [Name], [TableName], [RegularExpression], 
    [StartValidDate], [EndValidDate], [Serde], [SerdeProperties], 
    [Encoding], [HeaderLines], [FieldDelimiter], [LastUpdated],
    [LastDeployed], [Format], [NullText], [ReCreateTableOnDeployment],
    [RowDelimiter], [FilesPerDrop], [SourcePath], [Description],
    [AdditionalProperties], [IdTableFormat], [GenerateDeltaTable],
    [ParentSecurityPath], [CreationDate], [ItemId])
VALUES
    (@Id_Provider, @EntityName, @EntityName, @Entity_RegularExpression,
    DATEADD(DAY, -1, @CurrentDate), NULL, NULL, NULL,
    @EntityEncoding, @EntityHeaderLines, @EntityDelimiter, GETUTCDATE(),
    NULL, @EntityFormat, NULL, 1,
    NULL, 1, NULL, NULL,
    @EntityProperties, 3, 0,
    @Entity_ParentSecurityPath, GETUTCDATE(), @EntityItemId)

Add a new Entity by using Sidra API

Sidra API requires requests to be authenticated, the section How to use Sidra API explains how to create an authenticated requests. For the rest of the document, it is going to be supposed that Sidra API is deployed in the following URL:

https://core-mycompany-dev-wst-api.azurewebsites.net

This is the sequence of requests required to create a new Entity, some of the request are used to gather information about the Entity, if that information is already available, those requests will not be necessary:

Step 1. Get the identifier of the Provider

Before creating an Entity, it is required to know the Id Provider to which the Entity will be associated. If the identifier is already known, this step can be skipped. If not, this request can be made to retrieve all the Providers:

Request

GET https://core-mycompany-dev-wst-api.azurewebsites.net/api/metadata/providers/all?api-version=1.0

It will return an object with the list of Providers, including their Ids.

Response

[
    {
        "id": 10,
        "itemId": "3fa85f64-5717-4562-b3fc-2c963f66afa6", 
        "providerName": "MyNewProvider",
        "databaseName": "dwmynewprovider",
        "dataStorageUnitName": "MyDSU"
    }
]

From the previous response, the Id of the Provider can be retrived and used in the next step to populate the field idProvider.

Step 2. Create a new Entity

By using the Sidra API some of the fields of the Entity are automatically populated -like Id, ParentSecurityPath or CreationDate-. Once the rest of the information of the entity is gathered, it can be created using this request.

Request

POST https://core-mycompany-dev-wst-api.azurewebsites.net/api/metadata/entities?api-version=1.0

And adding the following content as part of the body.

Request body

{
  "filesPerDrop": 1,
  "rowDelimiter": null,
  "reCreateTableOnDeployment": true,
  "nullText": null,
  "format": "csv",
  "lastDeployed": null,
  "lastUpdated": "2022-06-20T10:00:00.885Z",
  "fieldDelimiter": "\\u0009",
  "headerLines": 1,
  "encoding": "UTF-8",
  "serdeProperties": null,
  "serde": null,
  "endValidDate": null,
  "startValidDate": "2022-06-19T10:00:00.885Z",
  "regularExpression": "^myentity_((?<year>\\d{4})-(?<month>\\d{2})-(?<day>\\d{2}))_((?<hour>\\d{2})-(?<minute>\\d{2})-(?<second>\\d{2})).csv",
  "tableName": "MyNewEntity",
  "name": "MyNewEntity",
  "description": "Description of MyNewEntity",
  "idProvider": 10,
  "additionalProperties": "{\"sourceTableName\":\"[staging].[MyNewEntity]\"}",
  "generateDeltaTable": false,
  "idTableFormat": 3,
  "sourcePath": null
}

The response will return the Id of the Entity created, which can be used, for example, for adding a new Attribute.

Response

{
  "id": 100,
  "tags": null,
  "providerName": "MyNewProvider",
  "databaseName": "dwmynewprovider",
  "dataStorageUnitId": 1,
  "creationDate": "2022-06-20T08:49:48.13",
  "detail": null,
  "entitySize": null,
  "source": ".MyNewEntity",
  "viewDefinition": null,
  "entityType": 1,
  "dsu": null,
  "imageURL": null,
  "filesPerDrop": 1,
  "rowDelimiter": null,
  "reCreateTableOnDeployment": true,
  "nullText": null,
  "format": "csv",
  "lastDeployed": null,
  "lastUpdated": "2022/06/20",
  "fieldDelimiter": "\t",
  "headerLines": 1,
  "encoding": "UTF-8",
  "serdeProperties": null,
  "serde": null,
  "endValidDate": null,
  "startValidDate": "2022/06/19",
  "regularExpression": "^myentity_((?<year>\\d{4})-(?<month>\\d{2})-(?<day>\\d{2}))_((?<hour>\\d{2})-(?<minute>\\d{2})-(?<second>\\d{2})).csv",
  "tableName": "MyNewEntity",
  "name": "MyNewEntity",
  "description": "Description of MyNewEntity",
  "idProvider": 10,
  "additionalProperties": "{\"sourceTableName\":\"[staging].[MyNewEntity]\"}",
  "generateDeltaTable": false,
  "idTableFormat": 3,
  "sourcePath": null
}

Entity fields

This is the information about the Entities that must be included when a new Entity is stored in the metadata database:

Column Description
Id [Required] Entity identifier.
IdProvider [Required] Identifier of the Provider where this Entity will be associated to.
Name [Required] The name of the Entity.
TableName [Required] The name of the table created in Databricks to store this Entity.
RegularExpression [Required] Used to identify what is the Entity associated to a file. For example, if several Providers leave files in the same route of an SFTP server, taking a look at their naming convention the system can identify the Entity.
StartValidDate [Required] Date from which this Entity applies to the file. It could happen that, at any time, if an extra column is added to a file, or this file changes its schema. In order to preserve all previous data that has already been ingested in the system, two different Entities could be registered that are related to the same type of file, but generated in two different moments.
EndValidDate [Optional] Date until this Entity is applicable to the file.
Serde [Optional] If it is necessary to use a custom serializer in Hive to ingest these type of files, it must be specified here.
SerdeProperties [Optional] Properties of the specified serializer.
Encoding [Optional] Specifies the encoding used for the files related to this Entity.
HeaderLines [Required] Specifies the number of header lines of the files related to this Entity.
FieldDelimiter [Optional] The field delimiter used by the files. Can be specified as \uXXXX
LastUpdated [Required] Date in which the file metadata has been updated and needs to be deployed to have the latest changes synchronized with the rest of the system.
LastDeployed [Required] Date in which the file metadata has been deployed.
Format [Optional] The extension of the files.
NullText [Optional] If there is a special string used as a null value, it can be specified here.
ReCreateTableOnDeployment [Required] If it is required to create or recreate the table in the data lake. In case the table exists, it will be overwritten. Once the table has been created, this value is automatically be updated to 0.
RowDelimiter [Optional] The row delimiter used by the files. Can be specified as \uXXXX
FilesPerDrop [Required] Sometimes one type of file can be generated from different sources. In this case, this parameter indicates the number of files of the Entity needed to consider that the data for a day is complete.
SourcePath [Optional] Indicates the origin of the generated files. For example, in case of an SFTP server, this tells the path of the file inside the server.
Description [Optional] Description of this Entity.
AdditionalProperties [Optional] JSON content with additional properties for extensibility.
IdTableFormat [Optional] Id of the format of the table where this Entity is stored (ORC, Parquet and Delta are currently supported).
GenerateDeltaTable [Required] When set to true, an additional table is created where the change produced by each new Asset is traced.
ParentSecurityPath [Optional] The security path of the parent following the metadata model hierarchy.
Image [Optional] Image of the Entity to be displayed in the Data Catalog.
CreationDate [Required] Date of creation of the Entity.
Detail [Optional] Detailed description (makdown allowed) of the Entity to be displayed in the Data Catalog.
ViewDefinition [Optional] Used when the entity is a view instead of a table in the DSU, contains query to the view.
ItemId GUID of the Entity.
SecurityPath [Computed] The security path of the Entity.
IdDataStorageUnit [Required] Id of the Data Storage Unit where this Entity belongs.

Considerations:

  • RegularExpression. Sometimes this field must be agreed with other actors. That means, for instance, if the system is going to load files from an FTP and a third party is going to leave the files in a specific FTP server, DEV team has to discuss with the other team what is going to be the naming convention so that the ingestion process can recognize the file via RegularExpression. This scenario applies for FTP/landing zone file drops. When pulling from data sources such as SQL databases, the DEV team will be in charge of defining this and no interaction with other teams is required.
  • Serde. Here the source format of the file is specified. The value is often "csv" but it could be one of the allowed values for the spark.read.format() function, which, as of today are: json, parquet, jdbc, orc, libsvm, csv, text.
  • SerdeProperties. Depending on which source format the system ingests, the SerdeProperties may vary. To provide more context about this column, the content will be passed to the options function in Databricks. For instance, using the following configuration: Serde = 'csv', SerdeProperties = '"separatorChar" = "\u0001","quoteChar"= "\u0003","escapeChar"= "\u0002"', the outcome in Databricks Spark will be something like: read.format('csv').options(separatorChar"="\u0001","quoteChar"= "\u0003","escapeChar"= "\u0002").
  • Extension. Frequently used when the file is in csv format but compressed in zip or gzip. If the file is something like "myfile.csv.gz" then, Extension will be "gz".


Sidra Ideas Portal


Last update: 2022-07-14
Back to top