How to add a new attribute format¶
When reading a file to be ingested in the Data Lake, some of the columns of the file could contain values that must be interpreted as a particular type but when handling this data with Hive, it could not be interpreted correctly. For example, a column of the file is expected to be a Boolean type and Hive expects for a Boolean value to be
FALSE (uppercase), however, the values coming in the column are 1 or 0 (for true and false, respectively). Those values must be formatted so Hive can properly understand them. AttributeFormat handles that kind of situations. In this previous case, by adding an AttributeFormat that transforms the 0 into a
FALSE and another AttributeFormat to transform the 1 into a
The AttributeFormat is a specification about how to transform values of an Attribute when ingesting the data in Data Lake. AttributeFormat is not intended to be used for implementing business logic. It is addressed to perform technical transformations for data types such as boolean, nonstandard date formats, decimal separators, etc.
This is the information about an Atribute format that must be included when it is added to the metadata database:
|Id||[Required] Identifier of the AttributeFormat|
|IdAttribute||[Required] Identifier of the Attribute to be handled|
|SourceValue||[Optional] The value that comes in the column that should be interpreted. In the previous example, the 1 and 0 values.|
|RegExp||[Optional] Regular expression that, if found in the column, should be replaced.|
|HQLExpression||[Optional] The value that is expected on Hive. In the previous example, the
|LookupExpression||[Optional] Expression used in the lookup. It is only supported to have one single -as max- AttributeFormat with LookupExpression defined per Attribute.|
|ParentSecurityPath||[Optional] The security path of the parent following the metadata model hierarchy|
RegExpis similar to
SourceValuebut instead of defining the specific value in the column that must be formatted, it defines a regular expression. When the regular expression matches a value in the column, it means that the value must be formatted. If a value is defined in
RegExpwill be ignored.
Add attribute format using a SQL script¶
Add a SQL script following the naming conventions to the
Scripts\CoreContext folder or to the place configured in the DatabaseBuilder from where it retrieves the scripts. A sample of the script to add a new attribute format can be found below:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27
Add Attribute format using Sidra API¶
Sidra API requires requests to be authenticated, the section How to use Sidra API explains how to create an authenticated requests. For the rest of the document, it is going to be supposed that Sidra API is deployed in the following URL:
Before creating an Attribute format, it is required to know the Id of the Attribute to which it belongs.
Create a new Attribute format¶
By using the Sidra API some of the fields of the Attribute format are automatically populated -like
ParentSecurityPath-. Once the rest of the information of the Attribute format is gathered, it can be created using the following request:
And adding the following content as part of the body.
1 2 3 4 5
The response will return the Id of the Attribute format created. A request must be done for each of the Attribute format that want to be created, since, currently, the Sidra API does not support creating Attribute formats in bulk.
1 2 3 4 5 6 7 8