How to add a new attribute format

When reading a file to be ingested in the Data Lake, some of the columns of the file could contain values that must be interpreted as a particular type but when handling this data with Hive, it could not be interpreted correctly. For example, a column of the file is expected to be a Boolean type and Hive expects for a Boolean value to be TRUE or FALSE (uppercase), however, the values coming in the column are 1 or 0 (for true and false, respectively). Those values must be formatted so Hive can properly understand them. AttributeFormat handles that kind of situations. In this previous case, by adding an AttributeFormat that transforms the 0 into a FALSE and another AttributeFormat to transform the 1 into a TRUE.

The AttributeFormat is a specification about how to transform values of an Attribute when ingesting the data in Data Lake. AttributeFormat is not intended to be used for implementing business logic. It is addressed to perform technical transformations for data types such as boolean, nonstandard date formats, decimal separators, etc.

AttributeFormat information

This is the information about an attribute format that must be included when it is added to the metadata database:

Column Description
Id [Required] Identifier of the AttributeFormat
IdAttribute [Required] Identifier of the Attribute to be handled
SourceValue [Optional] The value that comes in the column that should be interpreted. In the previous example, the 1 and 0 values.
RegExp [Optional] Regular expression that, if found in the column, should be replaced.
HQLExpression [Optional] The value that is expected on Hive. In the previous example, the TRUE and FALSEvalues.
LookupExpression [Optional] Expression used in the lookup. It is only supported to have one single -as max- AttributeFormat with LookupExpression defined per Attribute.
ParentSecurityPath [Optional] The security path of the parent following the metadata model hierarchy

Considerations:

  • The RegExp is similar to SourceValue but instead of defining the specific value in the column that must be formatted, it defines a regular expression. When the regular expression matches a value in the column, it means that the value must be formatted. If a value is defined in SourceValue, the RegExp will be ignored.

Add attribute format using a SQL script

Add a SQL script following the naming conventions to the Scripts\CoreContext folder or to the place configured in the DatabaseBuilder from where it retrieves the scripts. A sample of the script to add a new attribute format can be found below:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
-- DECLARE
DECLARE @Id_Attribute INT = 1000 -- Id of the attribute to which the attribute format belongs 

DECLARE @SecurityPath_Attribute VARCHAR(100) =
(
    SELECT TOP(1) [SecurityPath]  
    FROM [DataIngestion].[Attribute] 
    WHERE [Id] = @Id_Attribute
)
DECLARE @Id_AttributeFormat INT = 
(
    SELECT COALESCE(MAX([Id]) +1, 1)
    FROM [DataIngestion].[AttributeFormat]  
)

-- ROLLBACK
DELETE FROM [DataIngestion].[AttributeFormat]
WHERE [Id] BETWEEN @Id_AttributeFormat AND @Id_AttributeFormat+1

SET IDENTITY_INSERT [DataIngestion].[AttributeFormat] ON
INSERT [DataIngestion].[AttributeFormat] 
    ([Id],                   [IdAttribute], [SourceValue], [RegExp], [HQLExpression], [ParentSecurityPath])
VALUES
     (@Id_AttributeFormat,   @Id_Attribute, N'0',          NULL,     N'FALSE',        @SecurityPath_Attribute)
    ,(@Id_AttributeFormat+1, @Id_Attribute, N'1',          NULL,     N'TRUE',         @SecurityPath_Attribute)

SET IDENTITY_INSERT [DataIngestion].[AttributeFormat] OFF

Add attribute format using Sidra API

Sidra API requires requests to be authenticated, the section How to use Sidra API explains how to create an authenticated requests. For the rest of the document, it is going to be supposed that Sidra API is deployed in the following URL:

1
https://core-mycompany-dev-wst-api.azurewebsites.net

Before creating an attribute format, it is required to know the Id of the attribute to which it belongs.

Create a new attribute format

By using the Sidra API some of the fields of the attribute format are automatically populated -like Id or ParentSecurityPath-. Once the rest of the information of the attribute format is gathered, it can be created using the following request:

Request

1
POST https://core-mycompany-dev-wst-api.azurewebsites.net/api/metadata/attributeformats?api-version=1.0

And adding the following content as part of the body.

Request body

1
2
3
4
5
{
    "idAttribute": 1000,
    "sourceValue": "0",
    "hqlExpression": "FALSE"
}

The response will return the Id of the attribute format created. A request must be done for each of the attribute format that want to be created, since, currently, the Sidra API does not support creating attribute formats in bulk.

Response

1
2
3
4
5
6
7
8
{
    "id": 10000,
    "idAttribute": 1000,
    "sourceValue": "0",
    "regExp": null,
    "hqlExpression": "FALSE",
    "lookupExpression": null
}