How metadata from a content is structured in a hierarchical model

Every time that a content is ingested into Sidra platform, some metadata of the content will be stored for tracking purposes and for having the information ready in order export the content.

The metadata stored from a content is organized as a set of elements which are inter-related. This is called the 'metadata model' in Sidra. The structure of the relationships between the elements of the metadata model is basically a hierarchy. These are the elements and the relationships between them:

  • Asset represents the content imported in Sidra. Each Asset contains metadata information about a specific instance of the content, for example a unique identifier across the platform, the date of the Asset, etc. Every Asset is associated with an Entity.
  • Entity defines the common properties of a group of Assets in terms of structure of the content. Every content ingested will generate its own Asset, but all the content that share the same structure will be associated to the same Entity. Each Entity contains metadata information about encoding, separators, null values management, etc. Each Entity contains a set of Attributes and every Entity belongs to a Provider.
  • Attribute defines the metadata information of a column of the content, for example the type of the values in the column, the order respect other columns of the same Entity, the name, etc. It also contains information about how the content will be ingested into the Data Lake. Each attribute belongs to an Entity.
  • AttributeFormat defines a transformation of the values of a column in order to be ingested into the Data Lake. For example, it can be used when the content has a boolean value with the value represented as "1" or "0" and it will be stored in the Data Lake as "TRUE" or "FALSE". Each attribute format belongs to an attribute.
  • Provider is a logic set of Entities, usually representing content with the same origin. Each Provider belongs to a Data Storage Unit.
  • Data Storage Unit (DSU) is a logical and physical isolation of data inside the Data Lake. The set of all the DSUs conforms the Data Lake.

Below is a schema depiction of the metadata model hierarchy:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
Data Lake 
    |
Data Storage Unit 
    |
Provider 
    |
Entity <----- Asset
    |
Attribute
    |
AttributeFormat

When a new type of content (new data source) needs to be added to the Data Storage Unit, it is necessary to associate it with an Entity that is integrated in the metadata model. If there are missing elements of the model, it is required to create them too:

  • The Data Lake is the sum of all the DSUs, so it is never required to create one.
  • The DSU is tied to the infrastructure and it is created in the deployment of Sidra. Unless the new type of content needs to be physically isolated from the rest of the data of the Data Lake, it will not be necessary to create a new DSU.
  • It will be definitely required to create a new Entity.
  • It could be necessary to create a new Provider if the new Entity does not logically belong to any of the existing Providers. A new Entity can therefore be associated to an existing Provider, or require a new Provided to be created first.
  • It will also be necessary to create several Attributes, one for each field -or column- of the new type of content.
  • AttributeFormats are optional, it will depend on the need of the new Attributes to be transformed.