How to add a new attribute

An attribute represents one of the fields or columns of an entity. So, after the creation of a new entity, the next step is to create new attributes for that entity, one for each of the fields/columns of the asset that is represented by the entity. If the entity is not created yet, this tutorial How to add a new entity can be used to create it.

Attributes provide details about the structure of the asset and they will be used, along provider and entity, to generate pipelines, transfer queries, data lake table creations, etc. They are stored in the Attribute table in the metadata database.

As it happens with other members of the metadata model. There are two ways to create a new attribute:

  • Create a SQL script to insert the new attribute in the database.
  • Use the Sidra API endpoint to add new attributes.

Attribute information

This is the information about an attribute that must be included when it is added to the metadata database:

Column Description
Id [Required] Attribute identifier
IdEntity [Required] The identifier of the related entity
Name [Required] Name of the column
HiveType [Required] The type of the column. It will be used to generate the Hive table
MaxLen [Optional] Maximum size of the column. This applies to columns with STRING type
IsNullable [Required] Tells if the column can have null values or not
NeedTrim [Required] Tells if the column's need to be trimmed
RemoveQuotes [Required] Tells if it is necessary to remove quotes from the column
ReplacedText [Optional] Indicates a text in the column to be replaced
ReplacementText [Optional] The text to replace the ReplacedText with
SpecialFormat [Optional] If the column need to pass a validation that cannot be enforced using the rest of the columns, it is specified here. For example, to validate a date's format
TreatEmptyAsNull [Required] Tells if empty values should be considered nulls
IsPrimaryKey [Required] Used when the IsMerged in entity is true to determine what data is new
Order [Required] The order of the column inside the file
IsCalculated [Required] Tells if the column is calculated, and not a column inside the file
IsPartitionColumn [Required] Tells if the column is used in the table partition or not
IsMetadata [Required] Tells if the column is metadata for the system, and not data of the file itself
SQLType [Optional] Indicates the type of the column to create a SQL table
ValidationText [Optional] Defines specific validations to be performed for this data instead of using the validations generated by the GenerateTransferQuery custom activity.
Description [Optional] Description of this attribute
SourceName [Optional] Name of the column in the data source. It is used in case it is different to the name used in Sidra
ParentSecurityPath [Optional] The security path of the parent following the metadata model hierarchy
IsEncrypted [Required] Tells if the column is encrypted

Considerations:

  • HiveType, MaxLen, IsNullable are basically used for validations. The Hive data types allowed to be set in the columns can be found in the Hive documentation. The use of IsNullable must be considered carefully since it is a common source of validation errors. It is recommended to set it to true when actually the column must be always not nullable. The best practice for nullable values in columns is to treat them in the Client apps.
  • SQLType: Many Client apps end up storing data extracted from the Data Lake in an Azure SQL Database. In order to auto-generate the tables to hold that data, this column contains specific SQL Server types such as decimal with scale and precision or datetime2.
  • IsPartitionColumn: As it has been described in previous sections, data is stored in Data Lake in a optimized way. One of these ways is by partitioning so extraction can be more accurate and faster. Setting this column to true will make the table created in Data Lake to be partitioned by this column. It is important to point that, if there are 2 columns acting as partition column, the order will impact on how the table is partitioned. That is, in the example below partitions will be: FileDate then IdSourceItem. It is important to make sure that columns are in the correct order.
  • IsPrimaryKey: Only used for "merged" entities, quite often set to false for all the attributes.
  • When a value for ValidationText is provided, any other validation is overriden and the only one applied is the ValidationText.
  • IsEncrypted can be used to encrypt the values for this attribute in the Data Lake.

Sidra system attributes

There are five attributes that are added to the original set of specific attributes for the entity. They are used by Sidra for multiple purposes and it is mandatory to have them. Those attributes are: SourceByteOffset, LoadDate , HasErrors, FileDate and IdSourceItem.

By default, Sidra partitions the Data Lake tables by FileDate and IdSourceItem but it is up to the team configuring the system to choose which partitions to use. Sidra team strongly encourage to use these two attributes or, at least, IdSourceItem which is widely used by the platform and impacts on its performance.

Add attribute using a SQL script

Add a SQL script following the naming conventions to the Scripts\CoreContext folder or to the place configured in the DatabaseBuilder from where it retrieves the scripts. A sample of the script to add a new attribute can be found below:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
-- DECLARE
DECLARE @Id_Entity INT = 100 -- Id of the entity to which the attribute belongs 

DECLARE @SecurityPath_Entity VARCHAR(100) =
(
    SELECT TOP(1) [SecurityPath]  
    FROM [DataIngestion].[Entity] 
    WHERE [Id] = @Id_Entity
)
DECLARE @Id_Attribute INT = 
(
    SELECT COALESCE(MAX([Id]) +1, 1)
    FROM [DataIngestion].[Attribute]    
)
DECLARE @Order_Attribute SMALLINT =
(
    SELECT COALESCE(MAX([Order]) +1, 1)
    FROM [DataIngestion].[Attribute]
    WHERE [IdEntity] = @Id_Entity   
)

-- ROLLBACK
DELETE FROM [DataIngestion].[Attribute]
WHERE [Id] BETWEEN @Id_Attribute AND @Id_Attribute+1 -- Update the number of attributes to add

-- SCRIPT
SET IDENTITY_INSERT [DataIngestion].[Attribute] ON
INSERT [DataIngestion].[Attribute] 
     ([Id],             [IdEntity], [Name],     [HiveType], [MaxLen], [IsNullable], [NeedTrim], [RemoveQuotes], [ReplacedText], [ReplacementText], [SpecialFormat], [TreatEmptyAsNull], [IsPrimaryKey], [IsCalculated], [Order],             [IsPartitionColumn], [IsMetadata], [ValidationText], [SQLType],   [ParentSecurityPath], [IsEncrypted]) 
VALUES 
     (@Id_Attribute,    @Id_Entity, N'ColumnA', N'INT',     NULL,      1,            0,          0,              NULL,           NULL,              NULL,           0,                  0,              0,              @Order_Attribute,    0,                   0,            NULL,             N'INT',      @SecurityPath_Entity, 0)
    ,(@Id_Attribute+1,  @Id_Entity, N'ColumnB', N'STRING',  1000,      1,            0,          0,              NULL,           NULL,              NULL,           0,                  0,              0,              @Order_Attribute+1,  0,                   0,            NULL,             N'NVARCHAR', @SecurityPath_Entity, 0) 
SET IDENTITY_INSERT [DataIngestion].[Attribute] OFF

Adding Sidra system attributes

In the most common scenarios, the script below can be used to add the Sidra system attributes. It is just required to setup the @Id_Entity to the appropriate value.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
-- DECLARE
DECLARE @Id_Entity INT = 100 -- Id of the entity to which the attribute belongs 

DECLARE @SecurityPath_Entity VARCHAR(100) =
(
    SELECT TOP(1) [SecurityPath]  
    FROM [DataIngestion].[Entity] 
    WHERE [Id] = @Id_Entity
)
DECLARE @Id_Attribute INT = 
(
    SELECT COALESCE(MAX([Id]) +1, 1)
    FROM [DataIngestion].[Attribute]    
)
DECLARE @Order_Attribute SMALLINT =
(
    SELECT COALESCE(MAX([Order]) +1, 1)
    FROM [DataIngestion].[Attribute]
    WHERE [IdEntity] = @Id_Entity   
)

-- ROLLBACK
DELETE FROM [DataIngestion].[Attribute]
WHERE [Id] BETWEEN @Id_Attribute AND @Id_Attribute+4

-- SCRIPT
SET IDENTITY_INSERT [DataIngestion].[Attribute] ON
INSERT [DataIngestion].[Attribute] 
     ([Id],             [IdEntity], [Name],              [HiveType], [MaxLen], [IsNullable], [NeedTrim], [RemoveQuotes], [ReplacedText], [ReplacementText], [SpecialFormat],                    [TreatEmptyAsNull], [IsPrimaryKey], [IsCalculated], [Order],            [IsPartitionColumn], [IsMetadata], [ValidationText], [SQLType],        [ParentSecurityPath], [IsEncrypted]) 
VALUES 
     (@Id_Attribute,    @Id_Entity, N'SourceByteOffset', N'INT',     NULL,     0,            0,          0,              NULL,           NULL,              N'BLOCK__OFFSET__INSIDE__FILE',     0,                  0,              1,              @Order_Attribute,   0,                    1,            NULL,             N'INT',          @SecurityPath_Entity, 0)
    ,(@Id_Attribute+1,  @Id_Entity, N'LoadDate',         N'STRING',  NULL,     0,            0,          0,              NULL,           NULL,              N'FROM_UNIXTIME(UNIX_TIMESTAMP())', 0,                  0,              1,              @Order_Attribute+1, 0,                    1,            NULL,             N'DATETIME2',    @SecurityPath_Entity, 0)
    ,(@Id_Attribute+2,  @Id_Entity, N'HasErrors',        N'BOOLEAN', NULL,     0,            0,          0,              NULL,           NULL,              N'FALSE',                           0,                  0,              1,              @Order_Attribute+2, 0,                    1,            NULL,             N'BIT',          @SecurityPath_Entity, 0)
    ,(@Id_Attribute+3,  @Id_Entity, N'FileDate',         N'DATE',    NULL,     0,            0,          0,              NULL,           NULL,              N'''${hiveconf:Date}''',            0,                  0,              1,              @Order_Attribute+3, 1,                    1,            NULL,             N'DATE',         @SecurityPath_Entity, 0)
    ,(@Id_Attribute+4,  @Id_Entity, N'IdSourceItem',     N'INT',     NULL,     0,            0,          0,              NULL,           NULL,              N'IdSourceItem',                    0,                  0,              1,              @Order_Attribute+4, 1,                    1,            NULL,             N'INT',          @SecurityPath_Entity, 0)
SET IDENTITY_INSERT [DataIngestion].[Attribute] OFF

Add attribute using Sidra API

Sidra API requires requests to be authenticated, the section How to use Sidra API explains how to create an authenticated requests. For the rest of the document, it is going to be supposed that Sidra API is deployed in the following URL:

1
https://core-mycompany-dev-wst-api.azurewebsites.net

Before creating an attribute, it is required to know the Id of the entity to which the attribute belongs.

Create a new attribute

By using the Sidra API some of the fields of the attribute are automatically populated -like Id or ParentSecurityPath-. Once the rest of the information of the attribute is gathered, it can be created using the following request:

Request

1
POST https://core-mycompany-dev-wst-api.azurewebsites.net/api/metadata/attributes?api-version=1.0

And adding the following content as part of the body.

Request body

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
{
    "sqlType": "INT",
    "isMetadata": false,
    "isPartitionColumn": false,
    "isCalculated": false,
    "order": 1,
    "isPrimaryKey": false,
    "treatEmptyAsNull": false,
    "removeQuotes": false,
    "needTrim": false,
    "isNullable": true,
    "hiveType": "INT",
    "name": "ColumnA",
    "idEntity": 100
}

The response will return the Id of the Attribute created, which can be used, for example, for adding a new attribute format. A request must be done for each of the attributes that want to be created, since, currently, the Sidra API does not support creating attributes in bulk.

Response

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
{
  "id": 1000,
  "sqlType": "string",
  "isMetadata": true,
  "isPartitionColumn": true,
  "isCalculated": true,
  "order": 0,
  "isPrimaryKey": true,
  "treatEmptyAsNull": true,
  "specialFormat": "string",
  "replacementText": "string",
  "replacedText": "string",
  "removeQuotes": true,
  "needTrim": true,
  "isNullable": true,
  "maxLen": 0,
  "hiveType": "string",
  "name": "string",
  "idEntity": 0,
  "validationText": "string",
  "description": "string",
  "sourceName": "string"
}

Adding Sidra system attributes

In order to add the system attributes, the same request can be used but using the following content in the body.

NOTE

The fields idEntity and order must be properly configured

Request body

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
{
    "sqlType": "INT",
    "isMetadata": true,
    "isPartitionColumn": false,
    "isCalculated": true,
    "order": 2,
    "isPrimaryKey": false,
    "treatEmptyAsNull": false,
    "removeQuotes": false,
    "needTrim": false,
    "isNullable": false,
    "hiveType": "INT",
    "name": "SourceByteOffset",
    "specialFormat": "BLOCK__OFFSET__INSIDE__FILE",
    "idEntity": 100
}
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
{
    "sqlType": "DATETIME2",
    "isMetadata": true,
    "isPartitionColumn": false,
    "isCalculated": true,
    "order": 3,
    "isPrimaryKey": false,
    "treatEmptyAsNull": false,
    "removeQuotes": false,
    "needTrim": false,
    "isNullable": false,
    "hiveType": "STRING",
    "name": "LoadDate",
    "specialFormat": "FROM_UNIXTIME(UNIX_TIMESTAMP())",
    "idEntity": 100
}
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
{
    "sqlType": "BIT",
    "isMetadata": true,
    "isPartitionColumn": false,
    "isCalculated": true,
    "order": 4,
    "isPrimaryKey": false,
    "treatEmptyAsNull": false,
    "removeQuotes": false,
    "needTrim": false,
    "isNullable": false,
    "hiveType": "BOOLEAN",
    "name": "HasErrors",
    "specialFormat": "FALSE",
    "idEntity": 100
}
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
{
  "sqlType": "DATE",
  "isMetadata": true,
  "isPartitionColumn": true,
  "isCalculated": true,
  "order": 5,
  "isPrimaryKey": false,
  "treatEmptyAsNull": false,
  "removeQuotes": false,
  "needTrim": false,
  "isNullable": false,
  "hiveType": "DATE",
  "name": "FileDate",
  "specialFormat": "'${hiveconf:Date}'",
  "idEntity": 100
}
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
{
  "sqlType": "INT",
  "isMetadata": true,
  "isPartitionColumn": true,
  "isCalculated": true,
  "order": 6,
  "isPrimaryKey": false,
  "treatEmptyAsNull": false,
  "removeQuotes": false,
  "needTrim": false,
  "isNullable": false,
  "hiveType": "INT",
  "name": "IdSourceItem",
  "specialFormat": "IdSourceItem",
  "idEntity": 100
}