How to add Assets to the platform via ingestion from the Landing zone

This section covers for the flow that an Asset follows in Sidra, from extraction, to copying to the landing zone, until triggering its final ingestion in the Data Storage Unit.

At this point all the metadata regarding the Asset -Provider, entity, Attributes and attributes formats- should be set up. For more information about the metadata model in Sidra please access the Data ingestion metadata documentation.

Once all these steps have been taken, it is necessary to configure everything related to the data movement and orchestration.

An Asset in Sidra is a representation of each instance of the data elements that get ingested into the platform. These data elements can range from an intermediate extract from a database into a CSV format, to a PDF file which is part of a larger collection.

It is important to add that certain types of data ingestion flows extract directly the data from the data source (e.g. SQL Database), until the raw format in the Data Storage Unit, without adding them to the landing zone. For certain other types of data flows, such as for the generic batch file intake process, we will use the landing zone file ingestion as the mechanism.

An overview of this generic Asset flow is summarized below:

  1. Extract data from the data source and convert into a file.

  2. Copy the file to the landing zone.

  3. Launch the ingestion process from the landing zone.

  4. File registration using Sidra API.

  5. File ingestion in the Data Storage Unit.

This page covers for steps 1, 2 and 3. These steps are also described in How an Asset flows into the platform via the Landing zone. Steps 4 and 5 are explained in How Assets are ingested.

Below sections describe different methods to cover for the above process, steps 1 and 2 in order to extract data from the source system and to ingest an Asset into the Data Storage Unit in Sidra.

Method 1: Use internal pipelines with triggers to copy the data to the landing zone

This method uses internal pipelines created in Azure Data Factory to move the data from the data source to the landing zone.

This process covers for steps 1 and 2 of the generic flow summarized above.

When we use the term "Internal" for pipelines, this means that the pipelines are created in the Data Factories of the platform.

The internal pipelines must be created specifically for the Provider, but there is a set of accelerators to help in the implementation:

  • Linked services are created to allow the pipelines access to the Provider. A Linked Service is a represenation of the connection to the data source.

In the section Connecting to new data sources through linked services, the steps to add new linked services to the platform are explained.

  • Data Factory pipelines can be defined in a JSON format. Sidra stores the information of the pipelines as set of templates and instances that are composed to build that JSON. All that information is stored in the Sidra Core metadata database. Finally, the Sidra Data Factory Manager composes the JSON and uses it to programmatically create the pipeline in Data Factory. In the section Configure new pipelines, the steps to create a pipeline based on templates are explained.

  • Data Factory includes several native activities for copying and transforming data.

If they are not enough, Python scripts with the desired behavior can be created and orchestrated through ADF Databricks activities. As a last resort, custom activities with the desired behavior could be created. This is not the recommended solution, as it requires Azure Batch and increases complexity and costs. Section How to create a custom activity however explains how to do this.

Additionally to the pipelines, some triggers must be created that launch the execution of the pipelines. The triggers can be configured on a time schedule or to respond to events in an Azure Blob Storage.

Method 2: Asset is pushed to the landing zone

In this scenario the steps 1 and 2 from the generic flow are implemented outside the platform and in the platform the process starts in step 3. That means that using this method, the flow starts when the file has been already copied to the landing zone.

How the file is copied there is outside of the scope of the platform. FOr example:

  • A file could be manually copied to the landing zone.
  • A third party component developed by the customer could be in charge of extracting the data from the source data system, and copying the extracted file into the landing zone.
  • As it is described below, the Sidra API could be used for extracting the data from the source system.

Once the file is stored in the landing zone, a Data Factory trigger is launched. This Data Factory trigger then executes the IngestFromLanding pipeline (covering for step 3 above).

This method is more flexible than the previous one. Data Factory is a versatile tool but there could be scenarios in which the retrieval of data from the source system is not possible using Data Factory.

For this scenario the only thing to have into account is to agree on the naming convention for the file. As it was explained before, files are registered in the system based on a naming convention that must match a regular expression (regex), which is stored in the Entity metadata. If the regex does not match, the IngestFromLandingPipeline will fail upon execution. Once the regex has been agreed, the system is good to go to load files into the Data Storage Unit.

Using Sidra API to push files to the landing zone

Sidra API requires requests to be authenticated, the section How to use Sidra API explains how to create an authenticated requests. For the rest of the document, it is going to be supposed that Sidra API is deployed in the following URL:

1
https://core-mycompany-dev-wst-api.azurewebsites.net

In order to upload a file to the landing zone, the steps below must be followed:

Step 1. Get SAS token to the Azure Storage

Request

1
GET https://core-mycompany-dev-wst-api.azurewebsites.net/api/datalake/landingzones/tokens?assetname=example_20190101.csv&api-version=1.0

The response will be an Azure Storage SAS token that could be used to upload the file.

Response

1
2
3
[
  "SAS token"
]

Step 2. Upload the file into the Azure Storage using the SAS token provided

Once the file has been uploaded to the storage, the pipelines will be automatically executed in Data Factory to ingest the data into the Data Storage Unit.

At this stage, the DataFactory instance can be checked to verify that the IngestFromLanding pipeline is running.

Method 3: Using the Sidra API to do ingest Assets from an external side

This method skips steps 1, 2 and 3 from the generic flow. The IngestFromLanding pipeline copies the file from the landing zone to the Azure Storage of the Data Storage Unit and then requests the Sidra API to proceed with the file registration. In this third approach all those actions are executed from an external side. In this case a third party component -meaning a software outside of the Sidra platform- copies the file to the Azure Storage and then requests Sidra API to register the file.

Even if this option is available, it is highly recommended to use the other two previous methods unless there is some important restriction.