How metadata from a content is structured in a hierarchical model

Every time that a content is ingested into Sidra platform, it will be stored some metadata of the content for tracking purposes and for having the information ready in order export the content.

The metadata stored from a content is organized in a set of elements related between them, that is called the 'metadata model'. The structure of the relationships between the elements of the metadata model is basically a hierarchy. These are the elements and the relationships between them:

  • Asset represents the content imported in Sidra. Each asset contains metadata information about the content, for example a unique identifier across the platform, the date of the asset, etc. Every asset is associated with an entity.
  • Entity defines the common properties of a group of assets in terms of structure of the content. Every content ingested will generate its own asset, but all the content that share the same structure will be associated to the same entity. Each entity contains metadata information about encoding, separators, null values management, etc. Each entity contains a set of Attributes and every entity belongs to a Provider.
  • Attribute defines the metadata information of a column of the content, for example the type of the values in the column, the order respect other columns of the same entity, the name, etc. It also contains information about how the content will be ingested into the Data Lake. Each attribute belongs to an Entity.
  • AttributeFormat defines a transformation of the values of a column in order to be ingested into the Data Lake. For example, it can be used when the content has a boolean value with the value represented as "1" or "0" and it will be stored in the Data Lake as "TRUE" or "FALSE". Each attribute format belongs to an attribute.
  • Provider is a logic set of entitys, usually representing content with the same origin. Each Provider belongs to a Data Storage Unit.
  • Data Storage Unit (DSU) is a logical and physical isolation of data inside the Data Lake. The set of all the DSUs conforms the Data Lake.

So the metadata model hierarchy is like following:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
Data Lake 
    |
Data Storage Unit 
    |
Provider 
    |
Entity <----- Asset
    |
Attribute
    |
AttributeFormat

When a new type of content wants to be added to the Data Lake, it is necessary to associate it with an entity that is integrated in the metadata model. If there are missing elements of the model, it is required to create them too:

  • The Data Lake is the sum of all the DSUs, so it is never required to create one.
  • The DSU is tight to the infrastructure and it is created in the deployment of Sidra. Unless it is required that the new type of content to be isolated physically to the rest of the data of the Data Lake, it will not be necessary to create a new DSU.
  • It will be definitely required to create a new entity.
  • It could be necessary to create a new provider if the new entity does not belong to any of the existing providers.
  • It will also be necessary to create several attributes, one for each field -or column- of the new type of content.
  • AttributeFormats are optional, it will depend on the need of the new attributes to be transformed.