IngestFromLanding pipeline

The IngestFromLanding pipeline implements the steps 3 and 4 of the How an asset flows into the platform. It registers the file and launches the ingestion process.

Definition

The pipeline uses the AutoGeneratedLandingLoader pipeline template:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
{
    "name": "##name##",
    "description": "Generated pipeline for ##name##",
    "properties": {
        "parameters": {
            "folderPath": {
                "type": "String"
            },
            "fileName": {
                "type": "String"
            }
        },
        "activities": [##Activities##]
    }
}

The AutoGeneratedLandingLoader is associated with these dataset templates:

  • LandingZoneDataset
  • LandingZoneFileDataset

And with these activity templates in this specific order:

  1. ImportFiles
  2. DeleteFiles

pipeline-ingest-from-landing

How does it work

Pipeline launch

The Core Storage Blob Created configured in the Data Factory detects a new file in the landing zone and executes ingest-from-landing. The trigger executes the pipeline covering the parameters of the pipeline -folderPath and fileName- with the information of the new file detected.

ImportFiles activity

This ImportFile custom activity uses the LandingZoneFileDataset to take the file from the landing zone and then requests the Sidra API to register the file in the system. The registration of the file consists in:

  1. Identifying what provider the file belongs to. For that it matches the filename with the naming convention of each entity, once identified the entity it is easy to get the provider since every entity is related to a single provider.

  2. Checking that the landing zone where the file resides corresponds to appropriate DataLake. In step 1, the file has been associate to a entity. Every entity is associate with a Data Lake and every Data Lake has a landing zone. The system checks that everything matches correctly.

  3. Copying the file from the landing zone to the raw storage associate to the Data Lake, where it will remain as a raw copy of the file.

  4. Inserting the information (name, date, entity, path to the landing...) of the file into the File table.

  5. Running the fileIngestion-databricks pipeline.

Optionally, it also registers the parts of the file and copy them to the same location where the file is stored.

DeleteFiles activity

This DeleteFiles custom activity uses the LandingZoneDataset to access to the landing zone and the fileName parameter from the pipeline to delete the file from the landing zone.