How to add assets to the platform

At this point all the metadata regarding the asset -provider, entity, attributes and attributes formats- should be set up. After this, it is necessary to configure everything related to data movement and orchestration. There are several methods to add an asset to the platform. The difference between them is in which point of the asset flow into the platform they start. An overview of the flow is summarized below:

  1. Extract data from the data source and convert into a file

  2. Copy the file to the landing zone

  3. Launching ingestion process from landing zone

  4. File registration using Sidra API

  5. File ingestion in the Data Lake

Method 1: Internal pipelines with triggers

This method uses internal pipelines created in Azure Data Factory to move the data from the data source to the landing zone. That process covers steps 1 and 2 of the generic flow. "Internal" means that the pipelines are created in the Data Factories of the platform.

The internal pipelines must be created specifically for the provider but there is a set of accelerators to help in the implementation:

  • Linked services are created to allow the pipelines access to the provider. In the section Connecting to new data sources through linked services, it is explained the steps to add new linked services to the platform.

  • Data Factory pipelines can be defined in a JSON format. Sidra stores the information of the pipelines as set of templates and instances that are composed to build that JSON. All that information is stored in the metadata database. Finally, the Sidra Data Factory Manager compose the JSON and uses it to programmatically create the pipeline in Data Factory. In the section Configure new pipelines, it is explained the steps to create a pipeline based on templates.

  • Data Factory includes several native activities for copying and transforming data. If they are not enough, it can be created custom activities with the desired behavior. In the section How to create a custom activity, it is explained how to do it.

Additionally to the pipelines, it must be created some triggers that launch the execution of the pipelines. The triggers can be configured on a time schedule or to respond to events in an Azure Blob Storage.

Method 2: Asset pushed to landing zone

In this scenario the steps 1 and 2 from the generic flow are implemented outside the platform and in the platform the process starts in step 3. That means that the flow starts when the file has been already copied to the landing zone. How the file is copied there is out of the limits of the platform. It could be manually copied, a third party component can be in charge of getting the data from the provider and copy the file into the landing or, as it is described below, using the Sidra API. When the file is stored in the landing zone, it launches a Data Factory trigger that executes the IngestFromLanding pipeline.

This method is more flexible than the previous one. Data Factory is a versatile tool but it could be scenarios in which the retrieval of data from the provider is not possible using Data Factory.

For this scenario the only matter to have into account is to agree the name convention for the file. As it was explained before, files are registered in the system based on a naming convention that must match a regular expression defined in entity. If the regex does not match, the IngestFromLandingPipeline will fail upon execution. Once the regex has been agreed, the system is good to go to load files in the Data Lake.

Using Sidra API to push files to the landing zone

Sidra API requires requests to be authenticated, the section How to use Sidra API explains how to create an authenticated requests. For the rest of the document, it is going to be supposed that Sidra API is deployed in the following URL:

1
https://core-mycompany-dev-wst-api.azurewebsites.net

In order to upload a file to the landing zone, the steps below must be followed:

1. Get SAS token to the Azure Storage

Request

1
GET https://core-mycompany-dev-wst-api.azurewebsites.net/api/datalake/landingzones/tokens?assetname=example_20190101.csv&api-version=1.0

The response will be an Azure Storage SAS token that could be used to upload the file.

Response

1
2
3
[
  "SAS token"
]

2. Upload the file into the Azure Storage using the SAS token provided

Once the file has been uploaded to the storage, the pipelines will be automatically executed in Data Factory to ingest the data into the Data Lake. At this stage, the DataFactory instance can be checked to verify that the IngestFromLanding pipeline is running.

Method 3: Using the API to do it from an external side

This method skips steps 1, 2 and 3 from the generic flow. The IngestFromLanding pipeline copies the file from the landing zone to the Azure Storage of the Data Lake and then request the Sidra API to proceed with the file registration. In this third approach all those actions are realized from an external side. So a third party component -meaning a software outside of the Sidra platform- copies the file to the Azure Storage and then request to Sidra API to register the file.

Despite this option is available, it is recommended to use the previous methods unless there were some restriction that avoid it use.