Assets metadata¶
All the data that will be ingested in the Data Lake must be extracted from the data source and stored in files in the platform. Those files are the "raw storage" of the information ingested in the Data Lake and must be in one of the supported file types, currently CSV and Parquet. The reason of using a specific file types is in order to automatize the ingestion process from the raw storage into the optimized storage of the Data Lake. That means that if the information is stored in any other type of file -such as an XML or JSON- in the data source, it must be previously converted to one of the supported file types.
To transfer the data from the raw storage to the Data Lake, the platform uses Spark SQL queries. The Asset metadata is used -among other purposes- to automatically generate those queries.
Common columns in tables¶
There are some common columns that are present in several tables:
-
SecurityPath: It is a path of identifiers separated by
/
used for authorization. The order of the identifiers in the path follows the Metadata model hierarchy:For example, the SecurityPath
1/10/100
identifies an Entity with Id100
which belongs to a Provider with Id10
that is contained in a DSU with Id1
. -
ParentSecurityPath: The
SecurityPath
of the parent element following the metadata model hierarchy.For example, the
ParentSecurityPath
of the Entity of the previous example is1/10
.
Provider table¶
Currently, any data from any source is extracted and stored in files. The data source could be a database, an API, a set of files stored in a SFTP server... Providers are a logical collection of Entities so they are used to organizing them. Usually the Entities from the same data source belong to the same Provider.
Column | Description |
---|---|
Id | Provider identifier |
ItemId | Global Unique identifier (GUID) |
ProviderName | Name of the Provider. Space character is not allowed. |
DatabaseName | Name of the database created in Data Lake to store data for this Provider |
Description | Description of the Provider |
Owner | Identification of the person responsible of the Provider |
IdDataLake | Identifier of the Data Storage Unit in which this Provider is contained |
ParentSecurityPath | The SecurityPath of the DSU that contains the Provider |
Image | Image used for the Provider in the Sidra Manager UI |
CreationDate | Date of the creation |
SecurityPath | Path of identifiers used for authorization |
Tag table¶
The tag table allows assigning tags to Entities and Providers. Some tags are used by the platform but it can also be created custom tags for a specific Sidra installation.
Column | Description |
---|---|
Id | Tag identifier |
Name | Name of the tag |
EntityTag table¶
The Entity tag table implements the association between an Entity and a tag.
Column | Description |
---|---|
IdEntity | Entity identifier |
IdTag | Tag identifier |
IdTagType | Tag type identifier |
ProviderTag table¶
The Provider tag table implements the association between a Provider and a tag.
Column | Description |
---|---|
IdProvider | Provider identifier |
IdTag | Tag identifier |
IdTagType | Tag type identifier |
TagType table¶
This table describes the supported types of tags.
Column | Description |
---|---|
Id | Tag type identifier |
Name | Name of the tag type |
Description | Description of the tag type |
The table contains the following static data:
Id | Name | Description |
---|---|---|
1 | Autogen | A system tag automatically created such as sensitive data |
2 | User provided | A user added tag for a given Provider |
Entity table¶
Information
Take into account that Sidra metadata tables do not allow for space characters in the names of Attributes or Entities. From version 2022.R2 1.12.2 update the metadata API will sanitize spaces in characters if they are provided in Entity or Attribute names, by replacing with `_` character.
An Entity defines the common properties of a group of Assets in terms of structure of the content. The Entity table contains data about the format of the Entity generated, as well as information about how it should be treated by the system.
Column | Description |
---|---|
Id | Entity identifier |
IdProvider | Identifier of the Provider that generates this Entity |
Name | Name of the Entity. Space character is not allowed. |
TableName | The name of the table created in Databricks to store the information of the Assets that are associated to this Entity |
RegularExpression | Used to identify when a file will be associated to the Entity. If the filename matches to the regular expression of the Entity, it will be generated an Asset for that file and associated that Asset to the Entity |
StartValidDate | Date from which this Entity applies to the Asset. It could happen that, at any time, an Asset starts presenting an extra column because it has been a schema change in the data source. In order to preserve all previous data that has already been ingested in the system, there could be registered two Entities that are related to the same type of Asset, but generated in two different moments |
EndValidDate | Date until this Entity applies to the file |
Serde | If it is necessary to use a custom serializer to ingest these Entities, it must be specified here |
SerdeProperties | Properties of the specified serializer |
Encoding | Specifies the encoding used for Assets related to this Entity |
HeaderLines | Specifies the number of header lines of the Assets related to this Entity |
FieldDelimiter | The field delimiter used by the files. Can be specified as \uXXXX |
IsMerged | If sets to true, only new data will be inserted in the partition |
LastUpdated | Date in which the Asset metadata has been updated and needs to be deployed to have the latest changes synchronized with the rest of the system |
LastDeployed | Date in which the Asset metadata was deployed and synchronized with the rest of the system |
Format | The format of the file. Currently it is supported parquet and csv |
NullText | If there is a special string used as a null value, it can be specified here |
ReCreateTableOnDeployment | If it is needed to create or recreate the table in the Data Lake. In case the table exists, it will be overwritten. Once the table has been created, this value changes to 0 |
RowDelimiter | The row delimiter used by the Assets. Can be specified as \uXXXX |
FilesPerDrop | Sometimes one type of file can be generated from different sources. In this case, this parameter indicates the number of files of the Entity needed to consider that the data for a day is complete |
SourcePath | Indicates the origin of the generated files. For example, in case of an SFTP server, this tells the path of the file inside it |
Description | Description of this Entity |
AdditionalProperties | JSON structure with additional properties used for extensibility |
IdTableFormat | Identifier of the format of the table where this Entity is stored (ORC, Parquet and Delta are supported) |
GenerateDeltaTable | When true, an additional table is created where the change produced by each new Asset are traced |
ParentSecurityPath | The SecurityPath of the Provider to which the Entity belongs |
Image | Image used for the Entity in the Sidra Manager UI |
CreationDate | Date of the creation |
SecurityPath | Path of identifiers used for authorization |
Detail | Markdown with Entity details |
IdDataStorageUnit | Identifier of the Data Storage Unit in which the Entity is contained |
ViewDefinition | Used when the Entity is a view instead of a table in the DSU |
ItemId | GUID of the Entity. |
TableFormat table¶
This table describes the supported formats of the table created in Databricks.
Column | Description |
---|---|
Id | TableFormat identifier |
Name | Name of the format |
The table contains the following static data:
Id | Name |
---|---|
1 | Orc |
2 | Parquet |
3 | Delta |
EntityEntity table¶
The EntityEntity
table allows to represent many to many relationships between Entities. There are several types of relations that can be established between Entities, for example, when the data source is a SQL database and two Entities represent two tables that are related between them in the data source, that semantic can be included in the Sidra platform using a relationship between those Entities. In order to differentiate the Entities involved in the relationship, one of them will be called the "Parent" Entity and the other the "Child" Entity.
Column | Description |
---|---|
Id | Relationship identifier |
IdParentEntity | Parent Entity identifier |
IdChildEntity | Child Entity identifier |
Kind | Kind of relationship between the Entities |
Description | Optional description to provide details about the relationship |
IdAttribute | Optional. Attribute identifier that can be involved in this relationship between Entities, for example, being the foreign key when Kind has the Relation value. |
These are the supported values for the Kind
column in the relationship between Entities:
Id | Name | Description |
---|---|---|
0 | Other | The IdChildEntity with IdParentEntity but with a different kind of relationship not specified by default. Check the Description field for further information. |
1 | Child | The IdChildEntity is generated from the IdParentEntity . |
2 | Delta | The IdChildEntity contains Delta data from the IdParentEntity . |
3 | Relation | For relational databases, specify that there is a relation between source tables. |
4 | ViewOf | Entity is a view of another Entity or Entities |
EntityDeltaLoad table¶
The EntityDeltaLoad
table is used for storing information about incremental load of data from the data source.
Column | Description |
---|---|
Id | Entity delta load identifier |
IdEntity | Identifier of the Entity associated to the delta |
IdDeltaAttribute | Identifier of the Attribute used for the incremental extraction if present |
DeltaIsDate | Flag to determine if the delta Attribute is Date or not |
LastDeltaValue | Latest value of the incremental Attribute that has been extracted |
IdAuxiliaryDeltaAttribute | Potential column that can be used for incremental extractions. Often used when date and time are separated in two columns |
LastAuxiliaryDeltaValue | Latest value for the auxiliary delta value |
CustomGetMaxValueExpression | If the delta Attribute is not present in the Entity this column contains the custom expression used to get the previously maximum value for incremental load. The CustomExtractDataExpression will be compared against this value to detect the changes |
CustomExtractDataExpression | If the delta Attribute is not present in the Entity this column contains the custom expression (SQL SELECT) used to query the source table to get the incremental data |
NeedReload | If true, this field indicates some data might be loss on incremental load and the table should be reloaded from scratch. |
EnableReload | If true, this indicates the reload from scratch can be done in the next load process. |
If IdDeltaAttribute
and IdEntity
are populated, then the values CustomgetMaxValueExpression
and CustomExtractDataExpression
are auto-populated as NULL, as they are not used.
Attribute table¶
Attribute contains data about the schema of the files. This information is used to identify columns in a file, create Hive tables, and, when necessary, create SQL tables in which the data is going to be extracted from Data Lake (this is for client applications). Most of the metadata about the columns is used to validate fields, since Hive does not enforce any constraint over the data (like nullables, maximum length, and so on).
Information
Take into account that Sidra metadata tables do not allow for space characters in the names of Attributes or Entities. From version 2022.R2 1.12.2 update the metadata API will sanitize spaces in characters if they are provided in Entity or Attribute names, by replacing with `_` character.
Column | Description |
---|---|
Id | Attribute identifier |
IdEntity | The identifier of the related Entity |
Name | Name of the column. Space character is not allowed. |
HiveType | The type of the column. It will be used to generate the Hive table |
MaxLen | Maximum size of the column. This applies to columns with STRING type |
IsNullable | Tells if the column can have null values or not |
NeedTrim | Tells if the column’s need to be trimmed |
RemoveQuotes | Tells if it is necessary to remove quotes from the column |
ReplacedText | Indicates a text in the column to be replaced |
ReplacementText | The text to replace the ReplacedText with |
SpecialFormat | If the column need to pass a validation that cannot be enforced using the rest of the columns, it is specified here. For example, to validate a date’s format |
TreatEmptyAsNull | Tells if empty values should be considered nulls |
IsPrimaryKey | Used when the IsMerged in Entity is true to determine what data is new |
Order | The order of the column inside the file |
IsCalculated | Tells if the column is calculated, and not a column inside the file |
IsPartitionColumn | Tells if the column is used in the table partition or not |
IsMetadata | Tells if the column is metadata for the system, and not data of the file itself |
SQLType | Indicates the type of the column to create a SQL table |
ValidationText | Defines specific validations to be performed for this data instead of using the validations generated by the GenerateTransferQuery custom activity |
Description | Description of this Attribute |
SourceName | Name of the column in the data source. It is used in case it is different to the name used in Sidra |
ParentSecurityPath | The SecurityPath of the Entity to which the Attribute belongs |
IsEncrypted | Tells if the column is encrypted |
SecurityPath | Path of identifiers used for authorization |
IdDataStorageUnit | Identifier of the Data Storage Unit in which the Entity is contained |
IdProvider | Identifier of the Provider that generates this type of file |
DataMask | [Optional] This field indicates the Data masking rule that will be applied in the preview of this field in the Data Catalog. |
ItemId | GUID of the Attribute. |
SourceType | Represents the type that the Attribute has in the source system (DB type in the case of databases) |
Note: DataMask relies on the Microsoft SQL server implementation. For DataMask functions, please refer to the list with all SQL Server functions from Microsoft documentation. An example of such value of DataMask field can be:
, which will mean that a custom data masking function will be applied,where only the first character in the string will not be masked.
Now, Attributes from an existing Entity can be encrypted. This means that, when loading data in the Databricks table, the transfer query will encrypt all Attributes whose isEncrypted
column is set to true. When querying the data on that table, encrypted columns will show a random string:
In addition, assets from an Entity can be encrypted as well, so anyone who has access to them, will not able to read the contents unless they are previously decrypted. To enable this, it is necessary to update the AdditionalProperties
field from the Entity table, and adding to the JSON the following property:
Apart from that, it is necessary to insert, in the EntityPipeline
table, a relationship between the Entity whose Assets are going to be encrypted and the pipeline FileIngestionWithEncryption
.
Lastly, to enable any kind of encryption is necessary to generate two parameters that must be present in the deployment library, which are:
- EncryptionKey: It is a 16, 24 or 32 bytes long string randomly generated.
- EncryptionInitializationVector: It is a 16 bytes long string randomly generated.
AttributeFormat table¶
Some values of the file can contain text that are expected to be interpreted as a particular type, but when handling this data with Hive, it could not be interpreted correctly. For example, take that a column of the file is expected to be a Boolean one. Hive expects for a Boolean value to be TRUE or FALSE. However, these values can be different depending of the system that generate them. AttributeFormat handles that.
Column | Description |
---|---|
Id | Identifier of the AttributeFormat row |
IdAttribute | The identifier of the column to be handled |
SourceValue | The value that comes in the column that should be interpreted. This field represents the source value to transform. MaximumLength(50) |
RegExp | Regular expression that, if found in the column, should be replaced. The expression to transform. It is applied only if SourceValue is null. This should be a valid RegEx expression. MaximumLength(100). |
HQLExpression | The value that is expected on Hive. The Hive expression used to replace values from SourceValue/RegExp to the one defined in this expression. This should be a valid HQL expression. NotEmpty, MaximumLength(512). |
LookupExpression | [Optional] Expression used in the lookup. It is only supported to have one single -as max- AttributeFormat with LookupExpression defined per Attribute. |
ParentSecurityPath | [Optional] The security path of the parent following the metadata model hierarchy |
ItemId | ItemId for the AttributeFormat record |
SecurityPath | [Optional] The security path of the Attribute following the metadata model hierarchy |
IdDataStorageUnit | Id of the Data Storage Unit |
IdEntity | Id of the parent Entity |
IdProvicer | Id of the parent Provider |
Considerations:
- The
RegExp
is similar toSourceValue
but instead of defining the specific value in the column that must be formatted, it defines a regular expression. When the regular expression matches a value in the column, it means that the value must be formatted. If a value is defined inSourceValue
, theRegExp
will be ignored. - The
LookupExpression
field is an optional expression to use lookups with the SourceName in the Attribute. The format is the following:[tableName];[columnToUseInJoinComparison];[columnToUse]
For example, let us supose that the SourceName isuserid
and Name isName
for a given Attribute, assuming a tableusers
where the username should be retrieved. In this case theUsers
table containsId
andUsername
as columns. To map the userId with the username,the expression will be:users;id;username
And the result will be similar to the following in the transferquery:SELECT ... t1.username AS Name ... FROM staging s JOIN dw_database.users t1 ON s.userid = t.id
Asset table¶
The Asset
table stores information about each of the data elements ingested in the platform.
Column | Description |
---|---|
Id | Asset identifier |
IdEntity | Identifier of the Entity related to this Asset |
AssetName | Name of the Asset |
AssetDate | Business date of the Asset being ingested |
LastUpdated | Date of the last time that the Asset was ingested in the system |
IdStatus | Identifier of the AssetStatus in which the Asset is |
Notes | Additional notes about the Asset |
SourcePath | Location of the Asset from where it was ingested in the system, e.g. the location of the landing zone |
Entities | Number of Entities in the Asset |
ByteSize | Size in bytes |
ValidationErrors | Number of validation errors encountered during the Asset ingestion |
ParentSecurityPath | The SecurityPath of the Provider to which the Entity belongs |
SecurityPath | Path of identifiers used for authorization |
IsEncrypted | Flag to mark that the Asset is encrypted |
IdDataStorageUnit | Identifier of the Data Storage Unit where the Asset is stored |
IdProvider | Identifier of the Provider to which the Asset belongs |
DestinationPath | The path of the file at destination. |
AssetStatus table¶
This table describes the status of an Asset in the system. This table is created in both Core and Client app databases, so the status set includes status for the ingestion flow into the Data Lake and for the extraction flow from the Data Lake. The last one is used in Client apps.
Column | Description |
---|---|
Id | AssetStatus identifier |
StatusName | Name of the status |
StatusDescription | Description of the status |
The table contains the following static data:
Id | Status name | Status description |
---|---|---|
0 | Error | There has been an error when processing the file in some step. |
1 | LoadToAzureBlobStorage | The file is completely loaded in Azure Blob storage. |
2 | MovedToDataLake | The file has been loaded in Data Lake storage. |
3 | ImportedFromDataLake | The file has been imported from the Data Lake into the client database. |
4 | UploadingToAzureBlobStorage | The file is being transferred to Azure Blob storage. |
5 | RetryUploadToAzureStorage | The upload has been cancelled and it has to be reuploaded. |
6 | ImportingFromDataLake | The file is being imported from the data lake into the client database. |
7 | ImportingInDataLake | The file is uploaded and being imported to Data Lake Store. |
These are the flows for ingestion in Data Lake -on the left in the image- and for extraction from Data Lake into a client app -on the right in the image-.
AssetPart table¶
The AssetPart
table stores parts of an Asset. It is used when a set of files are grouped to get one file.
Column | Description |
---|---|
Id | AssetPart identifier |
IdAsset | Identifier of the Asset of which it is part |
AssetPartName | Name of the Asset part |
ParentSecurityPath | The SecurityPath of the Provider to which the Entity belongs |
SecurityPath | Path of identifiers used for authorization |
IdDataStorageUnit | Identifier of the Data Storage Unit where the Asset is stored |
IdEntity | Identifier of the Entity related to this Asset |
IdProvider | Identifier of the Provider to which the Asset belongs |
TypeTranslation table¶
The TypeTranslation
table stores type mapping and transformation rules for the different stages of the Attributes processing when data sources are configured and ingested in Sidra
Column | Description |
---|---|
Id | AssetPart identifier |
SourceType | Data type in the source system (e.g. ntext, float, VARBINARY) |
Translation | Destination type or transformation expression to destination type (e.g. NVARCHAR(MAX) |
ConnectorID | GUID for the plugin used for the data intake process |
ConnectorVersion | Version of the plugin used for the data intake process |
IdTypeTranslationSink | Id of the Type Translation sink or destination |
StoredAsBase64 | Boolean indicating whether the type is stored as base64 |
See section about Data-ingestion for more information about the usage of this table.
The TypeTranslationSink
table includes different values:
Column | Description |
---|---|
1 | Parquet |
2 | Databricks |
3 | TSQL |
4 | Source |
PipelineLoadRestrictionMode¶
The PipelineLoadRestrictionMode
, PipelineLoadRestriction
and the PipelineLoadRestrictionObject
tables are required for storing the object (e.g.tables, views, etc) inclusions or exclusions to be considered when calling the metadata extraction pipeline for certain data sources, e.g. databases.
Section Configure new data source describes the general conceptual steps about the configuration of a new data source in Sidra. One of the required steps is to configure and create the metadata structures about the data source (Provider, Entities and Attributes).
In the case of databases, Sidra usually incorporates templates for metadata extraction pipelines, which are able to read into the schema of the source databases and create the Entity and Attributes metadata in Sidra Core metadata tables from the information obtained in that schema. The metadata extraction pipeline also includes as a parameter a list of objects to include or exclude.
These set of objects to include or exclude are called in Sidra LoadRestrictionObjects.
There are two types of Load restriction rule types: inclusion and exclusion mode.
These types are stored in the PipelineLoadRestrictionMode
reference table:
Column | Description |
---|---|
1 | Include |
2 | Exclude |
When using Inclusion mode, the list of load restriction objects will be applied with an inclusion policy (just include the objects in the LoadRestrictionObject tables), or exclusion policy (load all objects, except the objects in the LoadRestrictionObject tables).
PipelineLoadRestrictionObject¶
The PipelineLoadRestrictionObject
includes the actual list of objects related or belonging to each LoadRestriction
rule set.
Column | Description |
---|---|
Id | Id of the load restriction object |
IdPipelineLoadRestriction | Id of the LoadRestriction rule set configured in the platform |
ObjectName | A string with the name of the object that is part of the LoadRestriction rule set |
The format for the ObjectName
includes the name of the [table], or a concatenation of [database_name].[schema_name].[table_name], e.g. [TestDB0], [TestDB4].[dbo].[TestTable0].
PipelineLoadRestriction¶
The PipelineLoadRestriction
table is a relationship table that relates a LoadRestriction
rule set and a LoadRestrictionMode
.
Column | Description |
---|---|
Id | Id of the load restriction rule configured in the platform. This id is the same pipeline id of the pipeline used to load the data. |
IdLoadRestrictionMode | Id of the LoadRestrictionMode |
The information contained in the above three tables will be populated at new data source configuration time. These tables will then be consulted during the execution of the metadata extraction pipelines. See the documentation on data ingestion for more information on the usage of these tables.
DataIntakeProcess¶
The DataIntakeProcess
table (from Sidra version 2022.R1) is a table to relate connector plugin information, with the Provider Id and details about the Data Intake Process configuration.
See the documentation on data ingestion for more information on the usage of these tables.
Column | Description |
---|---|
Id | Id of the Data Intake Process |
ItemId | GUID for the Data Intake Process |
Name | Friendly name for the Data Intake Process |
Description | Short description for the Data Intake Process |
LastExecution | Last execution date for the plugin used to configure the Data Intake Process |
IdProvider | Id of the Provider associated to this Data Intake Process |
IdPluginVersion | Id of the Plugin Version used to configure this Data Intake Process |
CreationDate | Date when the Data Intake Process was first created |
LastUpdated | Date when the Data Intake Process object was last updated |