Assets metadata

All the data that will be ingested in the Data Lake must be extracted from the data source and stored in files in the platform. Those files are the "raw storage" of the information ingested in the Data Lake and must be in one of the supported file types, currently CSV and Parquet. The reason of using a specific file types is in order to automatize the ingestion process from the raw storage into the optimized storage of the Data Lake. That means that if the information is stored in any other type of file -such as an XML or JSON- in the data source, it must be previously converted to one of the supported file types.

To transfer the data from the raw storage to the Data Lake, the platform uses Spark SQL queries. The asset metadata is used -among other purposes- to automatically generate those queries.

Common columns in tables

There are some common columns that are present in several tables:

  • SecurityPath: It is a path of identifiers separated by / used for authorization. The order of the identifiers in the path follows the Metadata model hierarchy:

    {Data Storage Unit Id}/{Provider Id}/{Entity Id}

    For example, the SecurityPath 1/10/100 identifies an Entity with Id 100 which belongs to a provider with Id 10 that is contained in a DSU with Id 1.

  • ParentSecurityPath: The SecurityPath of the parent element following the metadata model hierarchy.

    For example, the ParentSecurityPath of the Entity of the previous example is 1/10.

Provider table

Currently, any data from any source is extracted and stored in files. The data source could be a database, an API, a set of files stored in a SFTP server... Providers are a logical collection of entities so they are used to organizing them. Usually the entities from the same data source belong to the same provider.

Column Description
Id Provider identifier
ItemId Global Unique identifier (GUID)
ProviderName Name of the provider
DatabaseName Name of the database created in Data Lake to store data for this provider
Description Description of the provider
Owner Identification of the person responsible of the provider
IdDataLake Identifier of the Data Storage Unit in which this provider is contained
ParentSecurityPath The SecurityPath of the DSU that contains the provider
Image Image used for the provider in the Sidra Manager UI
CreationDate Date of the creation
SecurityPath Path of identifiers used for authorization

Tag table

The tag table allows assigning tags to entities and providers. Some tags are used by the platform but it can also be created custom tags for a specific Sidra installation.

Column Description
Id Tag identifier
Name Name of the tag

EntityTag table

The Entity tag table implements the association between an Entity and a tag.

Column Description
IdEntity Entity identifier
IdTag Tag identifier
IdTagType Tag type identifier

ProviderTag table

The provider tag table implements the association between a provider and a tag.

Column Description
IdProvider Provider identifier
IdTag Tag identifier
IdTagType Tag type identifier

TagType table

This table describes the supported types of tags.

Column Description
Id Tag type identifier
Name Name of the tag type
Description Description of the tag type

The table contains the following static data:

Id Name Description
1 Autogen A system tag automatically created such as sensitive data
2 User provided A user added tag for a given provider

Entity table

An Entity defines the common properties of a group of assets in terms of structure of the content. The Entity table contains data about the format of the entity generated, as well as information about how it should be treated by the system.

Column Description
Id Entity identifier
IdProvider Identifier of the provider that generates this Entity
Name Name of the Entity
TableName The name of the table created in Databricks to store the information of the assets that are associated to this Entity
RegularExpression Used to identify when a file will be associated to the Entity. If the filename matches to the regular expression of the entity, it will be generated an asset for that file and associated that asset to the entity
StartValidDate Date from which this Entity applies to the asset. It could happen that, at any time, an asset starts presenting an extra column because it has been a schema change in the data source. In order to preserve all previous data that has already been ingested in the system, there could be registered two entities that are related to the same type of asset, but generated in two different moments
EndValidDate Date until this Entity applies to the file
Serde If it is necessary to use a custom serializer to ingest these entities, it must be specified here
SerdeProperties Properties of the specified serializer
Encoding Specifies the encoding used for assets related to this Entity
HeaderLines Specifies the number of header lines of the assets related to this Entity
FieldDelimiter The field delimiter used by the files. Can be specified as \uXXXX
IsMerged If sets to true, only new data will be inserted in the partition
LastUpdated Date in which the asset metadata has been updated and needs to be deployed to have the latest changes synchronized with the rest of the system
LastDeployed Date in which the asset metadata was deployed and synchronized with the rest of the system
Format The format of the file. Currently it is supported parquet and csv
NullText If there is a special string used as a null value, it can be specified here
ReCreateTableOnDeployment If it is needed to create or recreate the table in the Data Lake. In case the table exists, it will be overwritten. Once the table has been created, this value changes to 0
RowDelimiter The row delimiter used by the assets. Can be specified as \uXXXX
FilesPerDrop Sometimes one type of file can be generated from different sources. In this case, this parameter indicates the number of files of the entity needed to consider that the data for a day is complete
SourcePath Indicates the origin of the generated files. For example, in case of an SFTP server, this tells the path of the file inside it
Description Description of this Entity
AdditionalProperties JSON structure with additional properties used for extensibility
IdTableFormat Identifier of the format of the table where this Entity is stored (ORC, Parquet and Delta are supported)
GenerateDeltaTable When true, an additional table is created where the change produced by each new asset are traced
ParentSecurityPath The SecurityPath of the provider to which the Entity belongs
Image Image used for the Entity in the Sidra Manager UI
CreationDate Date of the creation
SecurityPath Path of identifiers used for authorization
Detail Markdown with Entity details
IdDataStorageUnit Identifier of the Data Storage Unit in which the Entity is contained
ViewDefinition Used when the Entity is a view instead of a table in the DSU
ItemId GUID of the Entity.

TableFormat table

This table describes the supported formats of the table created in Databricks.

Column Description
Id TableFormat identifier
Name Name of the format

The table contains the following static data:

Id Name
1 Orc
2 Parquet
3 Delta

EntityEntity table

The EntityEntity table allows to represent many to many relationships between entities. There are several types of relations that can be established between entities, for example, when the data source is a SQL database and two entities represent two tables that are related between them in the data source, that semantic can be included in the Sidra platform using a relationship between those entities. In order to differentiate the entities involved in the relationship, one of them will be called the "Parent" entity and the other the "Child" entity.

Column Description
Id Relationship identifier
IdParentEntity Parent Entity identifier
IdChildEntity Child Entity identifier
Kind Kind of relationship between the entities
Description Optional description to provide details about the relationship
IdAttribute Optional. Attribute identifier that can be involved in this relationship between entities, for example, being the foreign key when Kind has the Relation value.

These are the supported values for the Kind column in the relationship between entities:

Id Name Description
0 Other The IdChildEntity with IdParentEntity but with a different kind of relationship not specified by default. Check the Description field for further information.
1 Child The IdChildEntity is generated from the IdParentEntity.
2 Delta The IdChildEntity contains Delta data from the IdParentEntity.
3 Relation For relational databases, specify that there is a relation between source tables.
4 ViewOf Entity is a view of another Entity or Entities

EntityDeltaLoad table

The EntityDeltaLoad table is used for storing information about incremental load of data from the data source.

Column Description
Id Entity delta load identifier
IdEntity Identifier of the Entity associated to the delta
IdDeltaAttribute Identifier of the Attribute used for the incremental extraction if present
DeltaIsDate Flag to determine if the delta attribute is Date or not
LastDeltaValue Latest value of the incremental attribute that has been extracted
IdAuxiliaryDeltaAttribute Potential column that can be used for incremental extractions. Often used when date and time are separated in two columns
LastAuxiliaryDeltaValue Latest value for the auxiliary delta value
CustomGetMaxValueExpression If the delta attribute is not present in the Entity this column contains the custom expression used to get the previously maximum value for incremental load. The CustomExtractDataExpression will be compared against this value to detect the changes
CustomExtractDataExpression If the delta attribute is not present in the Entity this column contains the custom expression (SQL SELECT) used to query the source table to get the incremental data
NeedReload If true, this field indicates some data might be loss on incremental load and the table should be reloaded from scratch.
EnableReload If true, this indicates the reload from scratch can be done in the next load process.

If IdDeltaAttribute and IdEntity are populated, then the values CustomgetMaxValueExpression and CustomExtractDataExpression are auto-populated as NULL, as they are not used.

Attribute table

Attribute contains data about the schema of the files. This information is used to identify columns in a file, create Hive tables, and, when necessary, create SQL tables in which the data is going to be extracted from Data Lake (this is for client applications). Most of the metadata about the columns is used to validate fields, since Hive does not enforce any constraint over the data (like nullables, maximum length, and so on).

Column Description
Id Attribute identifier
IdEntity The identifier of the related Entity
Name Name of the column
HiveType The type of the column. It will be used to generate the Hive table
MaxLen Maximum size of the column. This applies to columns with STRING type
IsNullable Tells if the column can have null values or not
NeedTrim Tells if the column’s need to be trimmed
RemoveQuotes Tells if it is necessary to remove quotes from the column
ReplacedText Indicates a text in the column to be replaced
ReplacementText The text to replace the ReplacedText with
SpecialFormat If the column need to pass a validation that cannot be enforced using the rest of the columns, it is specified here. For example, to validate a date’s format
TreatEmptyAsNull Tells if empty values should be considered nulls
IsPrimaryKey Used when the IsMerged in Entity is true to determine what data is new
Order The order of the column inside the file
IsCalculated Tells if the column is calculated, and not a column inside the file
IsPartitionColumn Tells if the column is used in the table partition or not
IsMetadata Tells if the column is metadata for the system, and not data of the file itself
SQLType Indicates the type of the column to create a SQL table
ValidationText Defines specific validations to be performed for this data instead of using the validations generated by the GenerateTransferQuery custom activity
Description Description of this attribute
SourceName Name of the column in the data source. It is used in case it is different to the name used in Sidra
ParentSecurityPath The SecurityPath of the Entity to which the attribute belongs
IsEncrypted Tells if the column is encrypted
SecurityPath Path of identifiers used for authorization
IdDataStorageUnit Identifier of the Data Storage Unit in which the Entity is contained
IdProvider Identifier of the provider that generates this type of file
DataMask [Optional] This field indicates that Data masking rules will be applied in the preview of this field in the Data Catalog.
ItemId GUID of the Attribute.
SourceType Represents the type that the Attribute has in the source system (DB type in the case of databases)

Now, attributes from an existing Entity can be encrypted. This means that, when loading data in the Databricks table, the transfer query will encrypt all attributes whose isEncrypted column is set to true. When querying the data on that table, encrypted columns will show a random string:

databricks-attribute-encrypted

In addition, assets from an Entity can be encrypted as well, so anyone who has access to them, will not able to read the contents unless they are previously decrypted. To enable this, it is necessary to update the AdditionalProperties field from the Entity table, and adding to the JSON the following property:

1
{ "isEncrypted": "true" }

Apart from that, it is necessary to insert, in the EntityPipeline table, a relationship between the Entity whose assets are going to be encrypted and the pipeline FileIngestionWithEncryption.

Lastly, to enable any kind of encryption is necessary to generate two parameters that must be present in the deployment library, which are:

  • EncryptionKey: It is a 16, 24 or 32 bytes long string randomly generated.
  • EncryptionInitializationVector: It is a 16 bytes long string randomly generated.

AttributeFormat table

Some values of the file can contain text that are expected to be interpreted as a particular type, but when handling this data with Hive, it could not be interpreted correctly. For example, take that a column of the file is expected to be a Boolean one. Hive expects for a Boolean value to be TRUE or FALSE. However, these values can be different depending of the system that generate them. AttributeFormat handles that.

Column Description
Id Identifier of the AttributeFormat row
IdAttribute The identifier of the column to be handled
SourceValue The value that comes in the column that should be interpreted
RegExp Regular expression that, if found in the column, should be replaced
HQLExpression The value that is expected on Hive

Asset table

The Asset table stores information about each of the data elements ingested in the platform.

Column Description
Id Asset identifier
IdEntity Identifier of the Entity related to this asset
AssetName Name of the asset
AssetDate Business date of the asset being ingested
LastUpdated Date of the last time that the asset was ingested in the system
IdStatus Identifier of the AssetStatus in which the asset is
Notes Additional notes about the asset
SourcePath Location of the asset from where it was ingested in the system, e.g. the location of the landing zone
Entities Number of entities in the asset
ByteSize Size in bytes
ValidationErrors Number of validation errors encountered during the asset ingestion
ParentSecurityPath The SecurityPath of the provider to which the Entity belongs
SecurityPath Path of identifiers used for authorization
IsEncrypted Flag to mark that the asset is encrypted
IdDataStorageUnit Identifier of the Data Storage Unit where the asset is stored
IdProvider Identifier of the provider to which the asset belongs
DestinationPath The path of the file at destination.

AssetStatus table

This table describes the status of an asset in the system. This table is created in both Core and Client app databases, so the status set includes status for the ingestion flow into the Data Lake and for the extraction flow from the Data Lake. The last one is used in Client apps.

Column Description
Id AssetStatus identifier
StatusName Name of the status
StatusDescription Description of the status

The table contains the following static data:

Id Status name Status description
0 Error There has been an error when processing the file in some step.
1 LoadToAzureBlobStorage The file is completely loaded in Azure Blob storage.
2 MovedToDataLake The file has been loaded in Data Lake storage.
3 ImportedFromDataLake The file has been imported from the Data Lake into the client database.
4 UploadingToAzureBlobStorage The file is being transferred to Azure Blob storage.
5 RetryUploadToAzureStorage The upload has been cancelled and it has to be reuploaded.
6 ImportingFromDataLake The file is being imported from the data lake into the client database.
7 ImportingInDataLake The file is uploaded and being imported to Data Lake Store.

These are the flows for ingestion in Data Lake -on the left in the image- and for extraction from Data Lake into a client app -on the right in the image-.

file-status-flows

AssetPart table

The AssetPart table stores parts of an asset. It is used when a set of files are grouped to get one file.

Column Description
Id AssetPart identifier
IdAsset Identifier of the asset of which it is part
AssetPartName Name of the asset part
ParentSecurityPath The SecurityPath of the provider to which the Entity belongs
SecurityPath Path of identifiers used for authorization
IdDataStorageUnit Identifier of the Data Storage Unit where the asset is stored
IdEntity Identifier of the Entity related to this asset
IdProvider Identifier of the provider to which the asset belongs

TypeTranslation table

The TypeTranslation table stores type mapping and transformation rules for the different stages of the Attributes processing when data sources are configured and ingested in Sidra

Column Description
Id AssetPart identifier
SourceType Data type in the source system (e.g. ntext, float, VARBINARY)
Translation Destination type or transformation expression to destination type (e.g. NVARCHAR(MAX)
ConnectorID GUID for the plugin used for the data intake process
ConnectorVersion Version of the plugin used for the data intake process
IdTypeTranslationSink Id of the Type Translation sink or destination
StoredAsBase64 Boolean indicating whether the type is stored as base64

See section about Data-ingestion for more information about the usage of this table.

The TypeTranslationSink table includes different values:

Column Description
1 Parquet
2 Databricks
3 TSQL
4 Source

PipelineLoadRestrictionMode

The PipelineLoadRestrictionMode, PipelineLoadRestriction and the PipelineLoadRestrictionObject tables are required for storing the object (e.g.tables, views, etc) inclusions or exclusions to be considered when calling the metadata extraction pipeline for certain data sources, e.g. databases.

Section Configure new data source describes the general conceptual steps about the configuration of a new data source in Sidra. One of the required steps is to configure and create the metadata structures about the data source (Provider, Entities and Attributes).

In the case of databases, Sidra usually incorporates templates for metadata extraction pipelines, which are able to read into the schema of the source databases and create the Entity and Attributes metadata in Sidra Core metadata tables from the information obtained in that schema. The metadata extraction pipeline also includes as a parameter a list of objects to include or exclude.

These set of objects to include or exclude are called in Sidra LoadRestrictionObjects. There are two types of Load restriction rule types: inclusion and exclusion mode. These types are stored in the PipelineLoadRestrictionMode reference table:

Column Description
1 Include
2 Exclude

When using Inclusion mode, the list of load restriction objects will be applied with an inclusion policy (just include the objects in the LoadRestrictionObject tables), or exclusion policy (load all objects, except the objects in the LoadRestrictionObject tables).

PipelineLoadRestrictionObject

The PipelineLoadRestrictionObject includes the actual list of objects related or belonging to each LoadRestriction rule set.

Column Description
Id Id of the load restriction object
IdPipelineLoadRestriction Id of the LoadRestriction rule set configured in the platform
ObjectName A string with the name of the object that is part of the LoadRestriction rule set

The format for the ObjectName includes the name of the [table], or a concatenation of [database_name].[schema_name].[table_name], e.g. [TestDB0], [TestDB4].[dbo].[TestTable0].

PipelineLoadRestriction

The PipelineLoadRestriction table is a relationship table that relates a LoadRestriction rule set and a LoadRestrictionMode.

Column Description
Id Id of the load restriction rule configured in the platform. This id is the same pipeline id of the pipeline used to load the data.
IdLoadRestrictionMode Id of the LoadRestrictionMode

The information contained in the above three tables will be populated at new data source configuration time. These tables will then be consulted during the execution of the metadata extraction pipelines. See the documentation on data ingestion for more information on the usage of these tables.