Metadata in SidraΒΆ
Every time that a content is ingested into Sidra's Platform, some metadata of the content will be stored for tracking purposes and for having the information ready in order export the content.
The metadata stored from a content is organized as a set of elements which are inter-related. This is called the 'metadata model' in Sidra. The structure of the relationships between the elements of the metadata model is basically a hierarchy. These are the elements and the relationships between them:
- Asset represents the content imported in Sidra. Each Asset contains metadata information about a specific instance of the content, for example a unique identifier across the platform, the date of the Asset, etc. Every Asset is associated with an Entity.
- Entity defines the common properties of a group of Assets in terms of structure of the content. Every content ingested will generate its own Asset, but all the content that share the same structure will be associated to the same Entity. Each Entity contains metadata information about encoding, separators, null values management, etc. Each Entity contains a set of Attributes and every Entity belongs to a Provider.
- Attribute defines the metadata information of a column of the content, for example the type of the values in the column, the order respect other columns of the same Entity, the name, etc. It also contains information about how the content will be ingested into the Data Lake. Each Attribute belongs to an Entity.
- Provider is a logic set of Entities, usually representing content with the same origin. Each Provider belongs to a Data Storage Unit.
- Data Storage Unit (DSU) is a logical and physical isolation of data inside the Data Lake. The set of all the DSUs conforms the Data Lake.
Below is a schema depiction of the metadata model hierarchy:
When a new type of content (new data source) needs to be added to the Data Storage Unit, it is necessary to associate it with an Entity that is integrated in the metadata model. If there are missing elements of the model, it is required to create them too:
- The Data Lake is the sum of all the DSUs, so it is never required to create one.
- The DSU is tied to the infrastructure and is created in the deployment of Sidra. Unless the new type of content needs to be physically isolated from the rest of the data of the Data Lake, it will not be necessary to create a new DSU.
- It will be definitely required to create a new Entity.
- It could be necessary to create a new Provider if the new Entity does not logically belong to any of the existing Providers. A new Entity can therefore be associated to an existing Provider, or require a new Provided to be created first.
- It will also be necessary to create several Attributes, one for each field -or column- of the new type of content.