How to add a new entity¶
This tutorial explains how to to add a new entity in Sidra, this is required for example when a new type of file has to be loaded from a FTP server or new entities are going to be pulled from an API. When adding a new entity, there are two possible scenarios, the entity belongs to an existing provider or a new provider must be configured. If the provider is not created yet, this tutorial How to add a new provider can be used to create it. Entities are stored in the Entity table in the metadata database.
There are two ways to create a new entity:
- Create a SQL script to insert the new entity in the database.
- Use the Sidra API endpoint to add new entities.
The method used will depend, for example, on having access to the source code of the solution -where the SQL scripts must be placed to be executed- or having to integrate with a third-party tool. In the latter case, the Sidra API is the approach recommended.
This is the information about the entities that must be included when a new entity is stored in the metadata database:
|Id||[Required] Entity identifier|
|IdProvider||[Required] Identifier of the provider that generates this type of data|
|Name||[Required] The name of the entity|
|TableName||[Required] The name of the table created in Databricks to store this entity|
|RegularExpression||[Required] Used to identify what is a file's entity. For example, if several providers leave files in the same route of an SFTP server, taking a look at their naming convention the system can identify the entity|
|StartValidDate||[Required] Date from which this entity applies to the file. It could happen that, at any time, a file is added one extra column, or it changes it schema. In order to preserve all previous data that has already been ingested in the system, there could be registered two entities that are related to the same type of file, but generated in two different moments|
|EndValidDate||[Optional] Date until this entity applies to the file|
|Serde||[Optional] If it is necessary to use a custom serializer in Hive to ingest these type of files, it must be specified here|
|SerdeProperties||[Optional] Properties of the specified serializer|
|Encoding||[Optional] Specifies the encoding used for the files related to this entity|
|HeaderLines||[Required] Specifies the number of header lines of the files related to this entity|
|FieldDelimiter||[Optional] The field delimiter used by the files. Can be specified as \uXXXX|
|IsMerged||[Required] If sets to true, only new data will be inserted in the partition|
|LastUpdated||[Required] Date in which the file metadata has been updated and needs to be deployed to have the latest changes synchronized with the rest of the system|
|Format||[Optional] The extension of the files|
|NullText||[Optional] If there is a special string used as a null value, it can be specified here|
|ReCreateTableOnDeployment||[Required] If it is needed to create or recreate the table in the data lake. In case the table exists, it will be overwritten. Once the table has been created, this value is automatically updated to 0|
|RowDelimiter||[Optional] The row delimiter used by the files. Can be specified as \uXXXX|
|FilesPerDrop||[Required] Sometimes one type of file can be generated from different sources. In this case, this parameter indicates the number of files of the entity needed to consider that the data for a day is complete|
|SourcePath||[Optional] Indicates the origin of the generated files. For example, in case of an SFTP server, this tells the path of the file inside it|
|Description||[Optional] Description of this entity|
|AdditionalProperties||[Optional] Json content with additional properties for extensibility|
|IdTableFormat||[Optional] Id of the format of the table where this entity is stored (ORC, Parquet and Delta are currently supported)|
|GenerateDeltaTable||[Required] When true, an additional table is created where the change produced by each new asset are traced|
|ParentSecurityPath||[Optional] The security path of the parent following the metadata model hierarchy|
|CreationDate||[Required] Date of creation of the entity|
RegularExpression. Sometimes must be agreed with other actors. That means, for instance, if the system is going to load files from a FTP and a third party is going to leave the files in a specific FTP server, DEV team has to discuss with the other team what is going to be the naming convention for the so that the importing process can recognize the file via RegularExpression. This scenario applies for FTP/landing zone file drops. When pulling from data sources such as SQL databases, the DEV team will be in charge of defining this and no interaction with other teams is required.
Serde. Here it is specified the source format of the file. The value is often "csv" but it could be one of the allowed values for the
spark.read.format()function, which, as of today are: json, parquet, jdbc, orc, libsvm, csv, text.
SerdeProperties. Depending on which source format the system ingests, the SerdeProperties may vary. To provide more context about this column, the content of it will be passed to the options function in Databricks. For instance, using the following configuration: Serde = 'csv', SerdeProperties = '"separatorChar" = "\u0001","quoteChar"= "\u0003","escapeChar"= "\u0002"', the outcome in Databricks Spark will be something like:
read.format('csv').options(separatorChar"="\u0001","quoteChar"= "\u0003","escapeChar"= "\u0002").
Extension. Frequently used when the file is in csv format but compressed in zip or gzip. If the file is something like "myfile.csv.gz" then, Extension will be "gz".
Add entity using a SQL script¶
Add a SQL script following the naming conventions to the
Scripts\CoreContext folder or to the place configured in the DatabaseBuilder from where it retrieves the scripts. A sample of the script to add a new entity can be found below:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69
Add entity using Sidra API¶
Sidra API requires requests to be authenticated, the section How to use Sidra API explains how to create an authenticated requests. For the rest of the document, it is going to be supposed that Sidra API is deployed in the following URL:
This is the sequence of requests requiered to create a new entity, some of the request are used to gather information about the entity, if that information is already available, those requests will not be necessary.
Step 1. Get the identifier of the provider¶
Before creating an entity, it is required to know the Id provider to which the entity will be associated. If the identifier is already known, this step can be skipped. If not, this request can be made to retrieve all the providers.
It will return an object with the list of providers, including their Ids.
1 2 3 4 5 6 7 8
From the previous response, the Id of the provider can be retrived and used in the next step to populate the field
Step 2. Create a new entity¶
By using the Sidra API some of the fields of the entity are automatically populated -like
CreationDate-. Once the rest of the information of the entity is gathered, it can be created using this request.
And adding the following content as part of the body.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18
The response will return the Id of the Entity created, which can be used, for example, for adding a new attribute.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27