Skip to content

How to add a new Attribute format

When reading a file to be ingested in the Data Lake, some of the columns of the file could contain values that must be interpreted as a particular type but when handling this data with Hive, it could not be interpreted correctly. For example, a column of the file is expected to be a Boolean type and Hive expects for a Boolean value to be TRUE or FALSE (uppercase), however, the values coming in the column are 1 or 0 (for true and false, respectively). Those values must be formatted so Hive can properly understand them. AttributeFormat handles that kind of situations. In this previous case, by adding an AttributeFormat that transforms the 0 into a FALSE and another AttributeFormat to transform the 1 into a TRUE.

The AttributeFormat is a specification about how to transform values of an Attribute when ingesting the data in Data Lake. AttributeFormat is not intended to be used for implementing business logic. It is addressed to perform technical transformations for data types such as boolean, nonstandard date formats, decimal separators, etc.

Add attribute format using a SQL script

Add a SQL script following the naming conventions to the Scripts\CoreContext folder or to the place configured in the DatabaseBuilder from where it retrieves the scripts. A sample of the script to add a new Attribute format can be found below:

--  SPECIFY THE ATTRIBUTE FOR WHICH WE'RE ADDING FORMATS

DECLARE @Provider_Name VARCHAR(30) = '<Provider_Name, VARCHAR(30), MyProviderName>';
DECLARE @Entity_Name VARCHAR(30) = '<Entity_Name, VARCHAR(256), MyEntityName>';
DECLARE @Attribute_Name VARCHAR(30) = '<Attribute_Name, VARCHAR(50), MyAttributeName>';

--  LET'S PINPOINT THE ATTRIBUTE WE'RE CONFIGURING WITH FORMATS

DECLARE @ProviderID INT =
    (
        SELECT TOP (1) [Id]
            FROM [DataIngestion].[Provider]
            WHERE [ProviderName] = @Provider_Name
    );
DECLARE @Entity_ID INT = 
    (
        SELECT TOP (1) [Id]
            FROM [DataIngestion].[Entity]
            WHERE [Name] = @Entity_Name
                AND [IdProvider] = @ProviderID
    );
DECLARE @Attribute_ID INT = 
    (
        SELECT TOP (1) [Id]
            FROM [DataIngestion].[Attribute]
            WHERE [Name] = @Attribute_Name
                AND [IdEntity] = @Entity_ID
    );
DECLARE @Attribute_SecurityPath VARCHAR(100) =
    (
       SELECT TOP(1) [SecurityPath]  
            FROM [DataIngestion].[Attribute] 
            WHERE [Id] = @Attribute_ID
    );

--  ADD ATTRIBUTE FORMAT RECORDS

INSERT INTO [DataIngestion].[AttributeFormat] 
        (
            [IdAttribute],
            [SourceValue],
            [RegExp],
            [HQLExpression],
            [ParentSecurityPath]
        )
    VALUES
        (
            @Attribute_ID,             --  [IdAttribute]
            N'0',                      --  [SourceValue]
            NULL,                      --  [RegExp]
            N'FALSE',                  --  [HQLExpression]
            @Attribute_SecurityPath    --  [ParentSecurityPath]
        )
        ,
        (
            @Attribute_ID,             --  [IdAttribute]
            N'1',                      --  [SourceValue]
            NULL,                      --  [RegExp]
            N'TRUE',                   --  [HQLExpression]
            @Attribute_SecurityPath    --  [ParentSecurityPath]
        );

Add Attribute format using Sidra API

Sidra API requires requests to be authenticated, the section How to use Sidra API explains how to create an authenticated requests. For the rest of the document, it is going to be supposed that Sidra API is deployed in the following URL:

https://core-mycompany-dev-wst-api.azurewebsites.net

Before creating an Attribute format, it is required to know the Id of the Attribute to which it belongs.

Create a new Attribute format

By using the Sidra API some of the fields of the Attribute format are automatically populated -like Id or ParentSecurityPath-. Once the rest of the information of the Attribute format is gathered, it can be created using the following request:

Request

POST https://core-mycompany-dev-wst-api.azurewebsites.net/api/metadata/attributeformats?api-version=1.0

And adding the following content as part of the body.

Request body

{
  "idAttribute": 1000,
  "sourceValue": "0",
  "regExp": null,
  "hqlExpression": "FALSE",
  "lookupExpression": null
}

The response will return the Id of the Attribute format created. A request must be done for each of the Attribute format that want to be created, since, currently, the Sidra API does not support creating Attribute formats in bulk.

Response

{
  "id": 10,
  "idAttribute": 1000,
  "sourceValue": "0",
  "regExp": null,
  "hqlExpression": "FALSE",
  "lookupExpression": null
}

AttributeFormat fields

This is the information about an Atribute format that must be included when it is added to the metadata database:

Column Description
Id [Required] Identifier of the AttributeFormat
IdAttribute [Required] Identifier of the Attribute to be handled
SourceValue The value that comes in the column that should be interpreted. This field represents the source value to transform. MaximumLength(50)
RegExp Regular expression that, if found in the column, should be replaced. The expression to transform. It is applied only if SourceValue is null. This should be a valid RegEx expression. MaximumLength(100).
HQLExpression The value that is expected on Hive. The Hive expression used to replace values from SourceValue/RegExp to the one defined in this expression. This should be a valid HQL expression. NotEmpty, MaximumLength(512).
LookupExpression [Optional] Expression used in the lookup. It is only supported to have one single -as max- AttributeFormat with LookupExpression defined per Attribute.
ParentSecurityPath [Optional] The security path of the parent following the metadata model hierarchy
ItemId ItemId for the AttributeFormat record
SecurityPath [Optional] The security path of the Attribute following the metadata model hierarchy
IdDataStorageUnit Id of the Data Storage Unit
IdEntity Id of the parent Entity
IdProvicer Id of the parent Provider

Considerations:

  • The RegExp is similar to SourceValue but instead of defining the specific value in the column that must be formatted, it defines a regular expression. When the regular expression matches a value in the column, it means that the value must be formatted. If a value is defined in SourceValue, the RegExp will be ignored.
  • The LookupExpression field is an optional expression to use lookups with the SourceName in the Attribute. The format is the following: [tableName];[columnToUseInJoinComparison];[columnToUse] For example, let us supose that the SourceName is userid and Name is Name for a given Attribute, assuming a table users where the username should be retrieved. In this case the Users table contains Id and Username as columns. To map the userId with the username,the expression will be: users;id;username And the result will be similar to the following in the transferquery: SELECT ... t1.username AS Name ... FROM staging s JOIN dw_database.users t1 ON s.userid = t.id


Sidra Ideas Portal


Last update: 2022-07-14
Back to top