How metadata from a content is structured in a hierarchical model¶
Every time that a content is ingested into Sidra's Platform, some metadata of the content will be stored for tracking purposes and for having the information ready in order export the content.
The metadata stored from a content is organized as a set of elements which are inter-related. This is called the 'metadata model' in Sidra. The structure of the relationships between the elements of the metadata model is basically a hierarchy. These are the elements and the relationships between them:
- Asset represents the content imported in Sidra. Each Asset contains metadata information about a specific instance of the content, for example a unique identifier across the platform, the date of the Asset, etc. Every Asset is associated with an Entity.
- Entity defines the common properties of a group of Assets in terms of structure of the content. Every content ingested will generate its own Asset, but all the content that share the same structure will be associated to the same Entity. Each Entity contains metadata information about encoding, separators, null values management, etc. Each Entity contains a set of Attributes and every Entity belongs to a Provider.
- Attribute defines the metadata information of a column of the content, for example the type of the values in the column, the order respect other columns of the same Entity, the name, etc. It also contains information about how the content will be ingested into the Data Lake. Each Attribute belongs to an Entity.
- Provider is a logic set of Entities, usually representing content with the same origin. Each Provider belongs to a Data Storage Unit.
- Data Storage Unit (DSU) is a logical and physical isolation of data inside the Data Lake. The set of all the DSUs conforms the Data Lake.
Below is a schema depiction of the metadata model hierarchy:
When a new type of content (new data source) needs to be added to the Data Storage Unit, it is necessary to associate it with an Entity that is integrated in the metadata model. If there are missing elements of the model, it is required to create them too:
- The Data Lake is the sum of all the DSUs, so it is never required to create one.
- The DSU is tied to the infrastructure and it is created in the deployment of Sidra. Unless the new type of content needs to be physically isolated from the rest of the data of the Data Lake, it will not be necessary to create a new DSU.
- It will be definitely required to create a new Entity.
- It could be necessary to create a new Provider if the new Entity does not logically belong to any of the existing Providers. A new Entity can therefore be associated to an existing Provider, or require a new Provided to be created first.
- It will also be necessary to create several Attributes, one for each field -or column- of the new type of content.
- AttributeFormats are optional, it will depend on the need of the new Attributes to be transformed.
Common columns in tables¶
There are some common columns that are present in several tables of the Data Ingestion schema:
-
SecurityPath: It is a path of identifiers separated by
/
used for authorization. The order of the identifiers in the path follows the Metadata model hierarchy mentioned above:For example, the SecurityPath
1/10/100
identifies an Entity with Id100
which belongs to a Provider with Id10
that is contained in a DSU with Id1
. -
ParentSecurityPath: The
SecurityPath
of the parent element following the metadata model hierarchy.For example, the
ParentSecurityPath
of the Entity of the previous example is1/10
.