Assets metadata

All the data that will be ingested in the Data Lake must be extracted from the data source and stored in files in the platform. Those files are the "raw storage" of the information ingested in the Data Lake and must be in one of the supported file types, currently CSV and Parquet. The reason of using a specific file types is in order to automatize the ingestion process from the raw storage into the optimized storage of the Data Lake. That means that if the information is stored in any other type of file -such as an XML or JSON- in the data source, it must be previously converted to one of the supported file types.

To transfer the data from the raw storage to the Data Lake, the platform uses Spark SQL queries. The asset metadata is used -among other purposes- to automatically generate those queries.

Common columns in tables

There are some common columns that are present in several tables:

  • SecurityPath: It is a path of identifiers separated by / used for authorization. The order of the identifiers in the path follows the Metadata model hierarchy:

    {Data Storage Unit Id}/{Provider Id}/{Entity Id}

    For example, the SecurityPath 1/10/100 identifies an entity with Id 100 which belongs to a provider with Id 10 that is contained in a DSU with Id 1.

  • ParentSecurityPath: The SecurityPath of the parent element following the metadata model hierarchy.

    For example, the ParentSecurityPath of the entity of the previous example is 1/10.

Provider table

Currently, any data from any source is extracted and stored in files. The data source could be a database, an API, a set of files stored in a SFTP server... Providers are a logical collection of entities so they are used to organizing them. Usually the entities from the same data source belong to the same provider.

Column Description
Id Provider identifier
ProviderName Name of the provider
DatabaseName Name of the database created in Data Lake to store data for this provider
Description Description of the provider
Owner Identification of the person responsible of the provider
IdDataLake Identifier of the Data Storage Unit in which this provider is contained
ParentSecurityPath The SecurityPath of the DSU that contains the provider
Image Image used for the provider in the Sidra Manager UI
CreationDate Date of the creation
SecurityPath Path of identifiers used for authorization

Tag table

The tag table allows assigning tags to entities and providers. Some tags are used by the platform but it can also be created custom tags for a specific Sidra installation.

Column Description
Id Tag identifier
Name Name of the tag

EntityTag table

The entity tag table implements the association between an entity and a tag.

Column Description
IdEntity Entity identifier
IdTag Tag identifier
IdTagType Tag type identifier

ProviderTag table

The provider tag table implements the association between a provider and a tag.

Column Description
IdProvider Provider identifier
IdTag Tag identifier
IdTagType Tag type identifier

TagType table

This table describes the supported types of tags.

Column Description
Id Tag type identifier
Name Name of the tag type
Description Description of the tag type

The table contains the following static data:

Id Name Description
1 Autogen A system tag automatically created such as sensitive data
2 User provided A user added tag for a given provider

Entity table

An entity defines the common properties of a group of assets in terms of structure of the content. The entity table contains data about the format of the entity generated, as well as information about how it should be treated by the system.

Column Description
Id Entity identifier
IdProvider Identifier of the provider that generates this entity
Name Name of the entity
TableName The name of the table created in Databricks to store the information of the assets that are associated to this entity
RegularExpression Used to identify when a file will be associated to the entity. If the filename matches to the regular expression of the entity, it will be generated an asset for that file and associated that asset to the entity
StartValidDate Date from which this entity applies to the asset. It could happen that, at any time, an asset starts presenting an extra column because it has been a schema change in the data source. In order to preserve all previous data that has already been ingested in the system, there could be registered two entities that are related to the same type of asset, but generated in two different moments
EndValidDate Date until this entity applies to the file
Serde If it is necessary to use a custom serializer to ingest these entities, it must be specified here
SerdeProperties Properties of the specified serializer
Encoding Specifies the encoding used for assets related to this entity
HeaderLines Specifies the number of header lines of the assets related to this entity
FieldDelimiter The field delimiter used by the files. Can be specified as \uXXXX
IsMerged If sets to true, only new data will be inserted in the partition
LastUpdated Date in which the asset metadata has been updated and needs to be deployed to have the latest changes synchronized with the rest of the system
LastDeployed Date in which the asset metadata was deployed and synchronized with the rest of the system
Format The format of the file. Currently it is supported parquet and csv
NullText If there is a special string used as a null value, it can be specified here
ReCreateTableOnDeployment If it is needed to create or recreate the table in the Data Lake. In case the table exists, it will be overwritten. Once the table has been created, this value changes to 0
RowDelimiter The row delimiter used by the assets. Can be specified as \uXXXX
FilesPerDrop Sometimes one type of file can be generated from different sources. In this case, this parameter indicates the number of files of the entity needed to consider that the data for a day is complete
SourcePath Indicates the origin of the generated files. For example, in case of an SFTP server, this tells the path of the file inside it
Description Description of this entity
AdditionalProperties JSON structure with additional properties used for extensibility
IdTableFormat Identifier of the format of the table where this entity is stored (ORC, Parquet and Delta are supported)
GenerateDeltaTable When true, an additional table is created where the change produced by each new asset are traced
ParentSecurityPath The SecurityPath of the provider to which the entity belongs
Image Image used for the entity in the Sidra Manager UI
CreationDate Date of the creation
SecurityPath Path of identifiers used for authorization
Detail Markdown with entity details
IdDataStorageUnit Identifier of the Data Storage Unit in which the entity is contained
ViewDefinition Used when the entity is a view instead of a table in the DSU

TableFormat table

This table describes the supported formats of the table created in Databricks.

Column Description
Id TableFormat identifier
Name Name of the format

The table contains the following static data:

Id Name
1 Orc
2 Parquet
3 Delta

EntityEntity table

The EntityEntity table allows to represent many to many relationships between entities. There are several types of relations that can be established between entities, for example, when the data source is a SQL database and two entities represent two tables that are related between them in the data source, that semantic can be included in the Sidra platform using a relationship between those entities. In order to differentiate the entities involved in the relationship, one of them will be called the "Parent" entity and the other the "Child" entity.

Column Description
Id Relationship identifier
IdParentEntity Parent entity identifier
IdChildEntity Child entity identifier
Kind Kind of relationship between the entities
Description Optional description to provide details about the relationship
IdAttribute Optional. Attribute identifier that can be involved in this relationship between entities, for example, being the foreign key when Kind has the Relation value.

These are the supported values for the Kind column in the relationship between entities:

Id Name Description
0 Other The IdChildEntity with IdParentEntity but with a different kind of relationship not specified by default. Check the Description field for further information.
1 Child The IdChildEntity is generated from the IdParentEntity.
2 Delta The IdChildEntity contains Delta data from the IdParentEntity.
3 Relation For relational databases, specify that there is a relation between source tables.
4 ViewOf Entity is a view of another Entity or Entities

EntityDeltaLoad table

The EntityDeltaLoad table is used for storing information about incremental load of data from the data source.

Column Description
Id Entity delta load identifier
IdEntity Identifier of the entity associated to the delta
IdDeltaAttribute Identifier of the Attribute used for the incremental extraction if present
DeltaIsDate Flag to determine if the delta attribute is Date or not
LastDeltaValue Latest value of the incremental attribute that has been extracted
IdAuxiliaryDeltaAttribute Potential column that can be used for incremental extractions. Oftenly used when date and time are separated in two columns
LastAuxiliaryDeltaValue Latest value for the auxiliary delta value
CustomGetMaxValueExpression If the delta attribute is not present in the Entity this column contains the custom expression used to get the maximum value for incremental load
CustomExtractDataExpression If the delta attribute is not present in the Entity this column contains the custom expression used to query the source table to get the incremental data

Attribute table

Attribute contains data about the schema of the files. This information is used to identify columns in a file, create Hive tables, and, when necessary, create SQL tables in which the data is going to be extracted from Data Lake (this is for client applications). Most of the metadata about the columns is used to validate fields, since Hive does not enforce any constraint over the data (like nullables, maximum length, and so on).

Column Description
Id Attribute identifier
IdEntity The identifier of the related entity
Name Name of the column
HiveType The type of the column. It will be used to generate the Hive table
MaxLen Maximum size of the column. This applies to columns with STRING type
IsNullable Tells if the column can have null values or not
NeedTrim Tells if the column’s need to be trimmed
RemoveQuotes Tells if it is necessary to remove quotes from the column
ReplacedText Indicates a text in the column to be replaced
ReplacementText The text to replace the ReplacedText with
SpecialFormat If the column need to pass a validation that cannot be enforced using the rest of the columns, it is specified here. For example, to validate a date’s format
TreatEmptyAsNull Tells if empty values should be considered nulls
IsPrimaryKey Used when the IsMerged in entity is true to determine what data is new
Order The order of the column inside the file
IsCalculated Tells if the column is calculated, and not a column inside the file
IsPartitionColumn Tells if the column is used in the table partition or not
IsMetadata Tells if the column is metadata for the system, and not data of the file itself
SQLType Indicates the type of the column to create a SQL table
ValidationText Defines specific validations to be performed for this data instead of using the validations generated by the GenerateTransferQuery custom activity
Description Description of this attribute
SourceName Name of the column in the data source. It is used in case it is different to the name used in Sidra
ParentSecurityPath The SecurityPath of the entity to which the attribute belongs
IsEncrypted Tells if the column is encrypted
SecurityPath Path of identifiers used for authorization
IdDataStorageUnit Identifier of the Data Storage Unit in which the entity is contained
IdProvider Identifier of the provider that generates this type of file

Now, attributes from an existing Entity can be encrypted. This means that, when loading data in the Databricks table, the transfer query will encrypt all attributes whose isEncrypted column is set to true. When querying the data on that table, encrypted columns will show a random string:

databricks-attribute-encrypted

In addition, assets from an Entity can be encrypted as well, so anyone who has access to them, will not able to read the contents unless they are previously decrypted. To enable this, it is necessary to update the AdditionalProperties field from the Entity table, and adding to the JSON the following property:

1
{ "isEncrypted": "true" }

Apart from that, it is necessary to insert, in the EntityPipeline table, a relationship between the Entity whose assets are going to be encrypted and the pipeline FileIngestionWithEncryption.

Lastly, to enable any kind of encryption is necessary to generate two parameters that must be present in the deployment library, which are:

  • EncryptionKey: It is a 16, 24 or 32 bytes long string randomly generated.
  • EncryptionInitializationVector: It is a 16 bytes long string randomly generated.

AttributeFormat table

Some values of the file can contain text that are expected to be interpreted as a particular type, but when handling this data with Hive, it could not be interpreted correctly. For example, take that a column of the file is expected to be a Boolean one. Hive expects for a Boolean value to be TRUE or FALSE. However, these values can be different depending of the system that generate them. AttributeFormat handles that.

Column Description
Id Identifier of the AttributeFormat row
IdAttribute The identifier of the column to be handled
SourceValue The value that comes in the column that should be interpreted
RegExp Regular expression that, if found in the column, should be replaced
HQLExpression The value that is expected on Hive

Asset table

The Asset table stores information about each of the data elements ingested in the platform.

Column Description
Id Asset identifier
IdEntity Identifier of the entity related to this asset
AssetName Name of the asset
AssetDate Business date of the asset being ingested
LastUpdated Date of the last time that the asset was ingested in the system
IdStatus Identifier of the AssetStatus in which the asset is
Notes Additional notes about the asset
SourcePath Location of the asset from where it was ingested in the system, e.g. the location of the landing zone
Entities Number of entities in the asset
ByteSize Size in bytes
ValidationErrors Number of validation errors encountered during the asset ingestion
ParentSecurityPath The SecurityPath of the provider to which the entity belongs
SecurityPath Path of identifiers used for authorization
IsEncrypted Flag to mark that the asset is encrypted
IdDataStorageUnit Identifier of the Data Storage Unit where the asset is stored
IdProvider Identifier of the provider to which the asset belongs

AssetStatus table

This table describes the status of an asset in the system. This table is created in both Core and Client app databases, so the status set includes status for the ingestion flow into the Data Lake and for the extraction flow from the Data Lake. The last one is used in Client apps.

Column Description
Id AssetStatus identifier
StatusName Name of the status
StatusDescription Description of the status

The table contains the following static data:

Id Status name Status description
0 Error There has been an error when processing the file in some step.
1 LoadToAzureBlobStorage The file is completely loaded in Azure Blob storage.
2 MovedToDataLake The file has been loaded in Data Lake storage.
3 ImportedFromDataLake The file has been imported from the Data Lake into the client database.
4 UploadingToAzureBlobStorage The file is being transferred to Azure Blob storage.
5 RetryUploadToAzureStorage The upload has been cancelled and it has to be reuploaded.
6 ImportingFromDataLake The file is being imported from the data lake into the client database.
7 ImportingInDataLake The file is uploaded and being imported to Data Lake Store.

These are the flows for ingestion in Data Lake -on the left in the image- and for extraction from Data Lake into a client app -on the right in the image-.

file-status-flows

AssetPart table

The AssetPart table stores parts of an asset. It is used when a set of files are grouped to get one file.

Column Description
Id AssetPart identifier
IdAsset Identifier of the asset of which it is part
AssetPartName Name of the asset part
ParentSecurityPath The SecurityPath of the provider to which the entity belongs
SecurityPath Path of identifiers used for authorization
IdDataStorageUnit Identifier of the Data Storage Unit where the asset is stored
IdEntity Identifier of the entity related to this asset
IdProvider Identifier of the provider to which the asset belongs