How to add a new Asset¶
Information
This tutorial applies to intakes of Landing Zone type.
The purpose of this page is to provide a guide of the steps required to ingest a new type of content into Sidra platform. It is important to understand which metadata about the new content must be generated and how that metadata is structured. Also, this document will cover the flow that an Asset follows in Sidra, from extraction, to copying to the landing zone, until triggering its final ingestion in the Data Storage Unit.
For more information about the metadata model in Sidra please access the Data ingestion metadata documentation.
An Asset in Sidra is a representation of each instance of the data elements that get ingested into the platform. These data elements can range from an intermediate extract from a database into a CSV format, to a PDF file which is part of a larger collection.
It is important to add that certain types of data ingestion flows extract directly the data from the data source (e.g. SQL Database), until the raw format in the Data Storage Unit, without adding them to the landing zone. For certain other types of data flows, such as for the generic batch file intake process, we will use the landing zone file ingestion as the mechanism.
In a brief, there are two sets of tasks that must be completed:
- Metadata configuration.
- Data movement orchestration.
1. Metadata configuration¶
The metadata is stored in the metadata database -the name of the database in Azure is Core, so it is also referred as the Core database-. Each one of the elements of the metadata model is stored in a table of that database. Configuring the metadata is basically inserting information in those tables.
The configuration will depend on the content to be ingested and the metadata information that already exists in the database.
For example, it could be necessary to add a new Provider or we could use one of the existing Providers.
The most common path to configure the metadata is the following one:
- Add a new Provider if none of the existing can be used.
- Add a new Entity.
- Add a new Attribute for each of the columns of the new type of content.
2. Data movement orchestration¶
The section describes the complete flow of an Asset from the data source to the Data Lake, when using landing zone as the starting point. Many of the data movements in that journey to deposit the files to the landing zone, or to ingest from the landing zone are already implemented and orchestrated by Sidra. Some others need to be configured but Sidra provides some accelerators to ease the task.
The steps to configure and orchestrate the data movements are:
- Review the Assets addition methods and select the most appropriate depending on the requirements. The most common scenario is using Data Factory and internal pipelines.
- Finally it is required to associate the Entity with the Data Factory pipelines. This is needed not only for the new created pipeline but also when using the already deployed FileIngestionDatabricks pipeline which is used in all the Asset adding methods.
- The following link describes how to associate an Entity with a pipeline.
Summary
- An overview of this generic Asset flow is summarized below:
- Extract data from the data source and convert into a file.
- Copy the file to the landing zone.
- Launch the ingestion process from the landing zone.
- File registration using Sidra API.
- File ingestion in the Data Storage Unit.
This section will cover for steps 1, 2 and 3. These steps are also described in this page about how an Asset flows into the platform via the Landing zone.
Below sections describe different methods to cover for the above process, steps 1 and 2 in order to extract data from the source system and to ingest an Asset into the Data Storage Unit in Sidra.
Method 1. Use internal pipelines with triggers to copy the data to the landing zone¶
This method uses internal pipelines created in Azure Data Factory to move the data from the data source to the landing zone.
This process covers for steps 1 and 2 of the generic flow summarized above.
When we use the term "Internal" for pipelines, this means that the pipelines are created in the Data Factories of the platform.
The internal pipelines must be created specifically for the Provider, but there is a set of accelerators to help in the implementation:
- Linked services are created to allow the pipelines access to the Provider. A Linked Service is a representation of the connection to the data source.
Step-by-step
Connecting to new data sources through linked services
For more information, check the specific tutorial for adding new linked services to the platform..
-
Data Factory pipelines can be defined in a JSON format. Sidra stores the information of the pipelines as set of templates and instances that are composed to build that JSON. All that information is stored in the Sidra Core metadata database. Finally, the Sidra Data Factory Manager composes the JSON and uses it to programmatically create the pipeline in Data Factory.
-
Data Factory includes several native activities for copying and transforming data.
If they are not enough, Python scripts with the desired behavior can be created and orchestrated through ADF Databricks activities.
Additionally to the pipelines, some triggers must be created that launch the execution of the pipelines. The triggers can be configured on a time schedule or to respond to events in an Azure Blob Storage.
Method 2. Asset is pushed to the landing zone¶
In this scenario the steps 1 and 2 from the generic flow are implemented outside the platform and in the platform the process starts in step 3. That means that using this method, the flow starts when the file has been already copied to the landing zone.
How the file is copied there is outside of the scope of the platform. For example:
-
A file could be manually copied to the landing zone following the next convention:
/<name_provider>/<name_entity>/<year>/<month>/<day>/<file>.<format>
-
A third party component developed by the customer could be in charge of extracting the data from the source data system, and copying the extracted file into the landing zone.
- As it is described below, the Sidra API could be used for extracting the data from the source system.
Once the file is stored in the landing zone, a Data Factory trigger is launched. This Data Factory trigger then executes the IngestFromLanding
pipeline (covering for step 3 above).
This method is more flexible than the previous one. Data Factory is a versatile tool but there could be scenarios in which the retrieval of data from the source system is not possible using Data Factory.
For this scenario the only thing to have into account is to agree on the naming convention for the file. As it was explained before, files are registered in the system based on a naming convention that must match a regular expression (regex), which is stored in the Entity metadata. If the regex does not match, the IngestFromLandingPipeline
will fail upon execution. Once the regex has been agreed, the system is good to go to load files into the Data Storage Unit.
Using Sidra API to push files to the landing zone¶
Sidra API requires requests to be authenticated, the section How to use Sidra API explains how to create an authenticated requests. For the rest of the document, it is going to be supposed that Sidra API is deployed in the following URL:
In order to upload a file to the landing zone, the steps below must be followed:
Step 1. Get SAS token to the Azure Storage¶
Request¶
GET https://core-mycompany-dev-wst-api.azurewebsites.net/api/datalake/landingzones/tokens?assetname=example_20190101.csv&api-version=1.0
The response will be an Azure Storage SAS token that could be used to upload the file.
Response¶
Step 2. Upload the file into the Azure Storage using the SAS token provided¶
Once the file has been uploaded to the storage, the pipelines will be automatically executed in Data Factory to ingest the data into the Data Storage Unit.
At this stage, the DataFactory instance can be checked to verify that the IngestFromLanding
pipeline is running.
Method 3. Using the Sidra API to do ingest Assets from an external side¶
This method skips steps 1, 2 and 3 from the generic flow. The IngestFromLanding
pipeline copies the file from the landing zone to the Azure Storage of the Data Storage Unit and then requests the Sidra API to proceed with the file registration. In this third approach all those actions are executed from an external side. In this case a third party component -meaning a software outside of the Sidra platform- copies the file to the Azure Storage and then requests Sidra API to register the file.
Even if this option is available, it is highly recommended to use the other two previous methods unless there is some important restriction.