How to add a new entity

This tutorial explains how to to add a new entity in Sidra, this is required for example when a new type of file has to be loaded from a FTP server or new entities are going to be pulled from an API. When adding a new entity, there are two possible scenarios, the entity belongs to an existing provider or a new provider must be configured. If the provider is not created yet, this tutorial How to add a new provider can be used to create it. Entities are stored in the Entity table in the metadata database.

There are two ways to create a new entity:

  • Create a SQL script to insert the new entity in the database.
  • Use the Sidra API endpoint to add new entities.

The method used will depend, for example, on having access to the source code of the solution -where the SQL scripts must be placed to be executed- or having to integrate with a third-party tool. In the latter case, the Sidra API is the approach recommended.

Entity information

This is the information about the entities that must be included when a new entity is stored in the metadata database:

Column Description
Id [Required] Entity identifier
IdProvider [Required] Identifier of the provider that generates this type of data
Name [Required] The name of the entity
TableName [Required] The name of the table created in Databricks to store this entity
RegularExpression [Required] Used to identify what is a file's entity. For example, if several providers leave files in the same route of an SFTP server, taking a look at their naming convention the system can identify the entity
StartValidDate [Required] Date from which this entity applies to the file. It could happen that, at any time, a file is added one extra column, or it changes it schema. In order to preserve all previous data that has already been ingested in the system, there could be registered two entities that are related to the same type of file, but generated in two different moments
EndValidDate [Optional] Date until this entity applies to the file
Serde [Optional] If it is necessary to use a custom serializer in Hive to ingest these type of files, it must be specified here
SerdeProperties [Optional] Properties of the specified serializer
Encoding [Optional] Specifies the encoding used for the files related to this entity
HeaderLines [Required] Specifies the number of header lines of the files related to this entity
FieldDelimiter [Optional] The field delimiter used by the files. Can be specified as \uXXXX
IsMerged [Required] If sets to true, only new data will be inserted in the partition
LastUpdated [Required] Date in which the file metadata has been updated and needs to be deployed to have the latest changes synchronized with the rest of the system
Format [Optional] The extension of the files
NullText [Optional] If there is a special string used as a null value, it can be specified here
ReCreateTableOnDeployment [Required] If it is needed to create or recreate the table in the data lake. In case the table exists, it will be overwritten. Once the table has been created, this value is automatically updated to 0
RowDelimiter [Optional] The row delimiter used by the files. Can be specified as \uXXXX
FilesPerDrop [Required] Sometimes one type of file can be generated from different sources. In this case, this parameter indicates the number of files of the entity needed to consider that the data for a day is complete
SourcePath [Optional] Indicates the origin of the generated files. For example, in case of an SFTP server, this tells the path of the file inside it
Description [Optional] Description of this entity
AdditionalProperties [Optional] Json content with additional properties for extensibility
IdTableFormat [Optional] Id of the format of the table where this entity is stored (ORC, Parquet and Delta are currently supported)
GenerateDeltaTable [Required] When true, an additional table is created where the change produced by each new asset are traced
ParentSecurityPath [Optional] The security path of the parent following the metadata model hierarchy
CreationDate [Required] Date of creation of the entity

Considerations:

  • RegularExpression. Sometimes must be agreed with other actors. That means, for instance, if the system is going to load files from a FTP and a third party is going to leave the files in a specific FTP server, DEV team has to discuss with the other team what is going to be the naming convention for the so that the importing process can recognize the file via RegularExpression. This scenario applies for FTP/landing zone file drops. When pulling from data sources such as SQL databases, the DEV team will be in charge of defining this and no interaction with other teams is required.
  • Serde. Here it is specified the source format of the file. The value is often "csv" but it could be one of the allowed values for the spark.read.format() function, which, as of today are: json, parquet, jdbc, orc, libsvm, csv, text.
  • SerdeProperties. Depending on which source format the system ingests, the SerdeProperties may vary. To provide more context about this column, the content of it will be passed to the options function in Databricks. For instance, using the following configuration: Serde = 'csv', SerdeProperties = '"separatorChar" = "\u0001","quoteChar"= "\u0003","escapeChar"= "\u0002"', the outcome in Databricks Spark will be something like: read.format('csv').options(separatorChar"="\u0001","quoteChar"= "\u0003","escapeChar"= "\u0002").
  • Extension. Frequently used when the file is in csv format but compressed in zip or gzip. If the file is something like "myfile.csv.gz" then, Extension will be "gz".

Add entity using a SQL script

Add a SQL script following the naming conventions to the Scripts\CoreContext folder or to the place configured in the DatabaseBuilder from where it retrieves the scripts. A sample of the script to add a new entity can be found below:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
-- DECLARE
DECLARE @Id_Provider INT = 10 -- Id of the provider to which the entity belongs

DECLARE @Id_Entity INT = 
(
    SELECT COALESCE(MAX([Id]) +1, 1)
    FROM [DataIngestion].[Entity]   
)

DECLARE @SecurityPath_Provider VARCHAR(100) =
( 
    SELECT TOP(1) [SecurityPath]  
    FROM [DataIngestion].[Provider] 
    WHERE [Id] = @Id_Provider
)

-- ROLLBACK
DELETE FROM [DataIngestion].[Entity]
WHERE [Id] = @Id_Entity

-- SCRIPT
SET IDENTITY_INSERT [DataIngestion].[Entity] ON
INSERT [DataIngestion].[Entity] (
    [Id], 
    [IdProvider], 
    [Name], 
    [TableName], 
    [RegularExpression],
    [StartValidDate],
    [EndValidDate],  
    [Serde],
    [SerdeProperties], 
    [Encoding], 
    [HeaderLines], 
    [FieldDelimiter], 
    [IsMerged],
    [LastUpdated],  
    [Format],
    [NullText],
    [ReCreateTableOnDeployment],
    [RowDelimiter],
    [FilesPerDrop],
    [GenerateDeltaTable],
    [ParentSecurityPath],
    [CreationDate])
VALUES (
    @Id_Entity, 
    @Id_Provider,
    N'MyEntityName',
    N'MyEntityTableName',
    N'^myentity-((?<year>\d{4})(?<month>\d{2})(?<day>\d{2}))\.csv', 
    '20191120', 
    NULL,    
    N'csv', 
    N'"separatorChar" = "\u0001","quoteChar"= "\u0003","escapeChar"= "\u0002"', 
    N'UTF-8', 
    0, 
    N'\u0001',
    0, 
    GETUTCDATE(),
    N'csv', 
    N'\\N', 
    0, 
    NULL,
    1,
    0,
    @SecurityPath_Provider,
    GETUTCDATE())
SET IDENTITY_INSERT [DataIngestion].[Entity] OFF

Add entity using Sidra API

Sidra API requires requests to be authenticated, the section How to use Sidra API explains how to create an authenticated requests. For the rest of the document, it is going to be supposed that Sidra API is deployed in the following URL:

1
https://core-mycompany-dev-wst-api.azurewebsites.net

This is the sequence of requests requiered to create a new entity, some of the request are used to gather information about the entity, if that information is already available, those requests will not be necessary.

Step 1. Get the identifier of the provider

Before creating an entity, it is required to know the Id provider to which the entity will be associated. If the identifier is already known, this step can be skipped. If not, this request can be made to retrieve all the providers.

Request

1
GET https://core-mycompany-dev-wst-api.azurewebsites.net/api/metadata/providers/all?api-version=1.0

It will return an object with the list of providers, including their Ids.

Response

1
2
3
4
5
6
7
8
[
    {
        "id": 10,
        "providerName": "MyNewProvider",
        "databaseName": "dw_newprovider",
        "datalakeName": "MyDSU"
    }
]

From the previous response, the Id of the provider can be retrived and used in the next step to populate the field idProvider.

Step 2. Create a new entity

By using the Sidra API some of the fields of the entity are automatically populated -like Id, ParentSecurityPath or CreationDate-. Once the rest of the information of the entity is gathered, it can be created using this request.

Request

1
POST https://core-mycompany-dev-wst-api.azurewebsites.net/api/metadata/entities?api-version=1.0

And adding the following content as part of the body.

Request body

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
{
    "filesPerDrop": 1,
    "reCreateTableOnDeployment": false,
    "nullText": "\\N",
    "format": "csv",
    "lastUpdated": "2020-01-23T11:09:17.885Z",
    "isMerged": false,
    "fieldDelimiter": "\\u0001",
    "headerLines": 0,
    "encoding": "UTF-8",
    "serdeProperties": "separatorChar = \"\\u002c\",escapeChar= \"\\u0002\"",
    "serde": "csv",
    "startValidDate": "2020-01-23T11:09:17.885Z",
    "regularExpression": "^myentity-((?<year>\\d{4})(?<month>\\d{2})(?<day>\\d{2}))\\.csv",
    "tableName": "MyEntityTableName",
    "name": "MyEntityName",
    "idProvider": 10
}

The response will return the Id of the Entity created, which can be used, for example, for adding a new attribute.

Response

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
{
    "id": 100,
    "providerName": "MyNewProvider",
    "dataLakeId": 1,
    "creationDate": "2020-01-23T11:31:26.267Z",
    "filesPerDrop": 1,
    "rowDelimiter": null,
    "reCreateTableOnDeployment": false,
    "nullText": "\\N",
    "format": "csv",
    "lastDeployed": null,
    "lastUpdated": "2020-01-23T11:09:17.885Z",
    "isMerged": false,
    "fieldDelimiter": "\\u0001",
    "headerLines": 0,
    "encoding": "UTF-8",
    "serdeProperties": "separatorChar = \"\\u002c\",escapeChar= \"\\u0002\"",
    "serde": "csv",
    "endValidDate": null,
    "startValidDate": "2020-01-23T11:09:17.885Z",
    "regularExpression": "^myentity-((?<year>\\d{4})(?<month>\\d{2})(?<day>\\d{2}))\\.csv",
    "tableName": "MyEntityTableName",
    "name": "MyEntityName",
    "description": null,
    "idProvider": 10,
    "additionalProperties": null
}