Data Factory metadata

The movement of data in Sidra platform is performed by Azure Data Factory (ADF). Every object in ADF is defined using a JSON structure that can be used along with the API and SDKs provided by ADF to deploy pipelines programmatically.

The Sidra platform takes advantage of this capability and provides a metadata system for storing those JSON structures and a component to automatically deploy those pipelines in ADF. The metadata system is composed by a set of tables in the metadata database that are generically called Data Factory metadata. Data Factory Manager is the component that uses the Data Factory metadata and programmatically deploys the ADF objects defined in it.

This system automates the deploying of pipelines so they can be easily setup in any environment -for example deploy to a new test environment- at any time given. The idea is not only to reproduce the environment configuration in Azure Data Factory, but also being able to add new pipelines by adding information to the metadata. The Data Factory metadata is organized using templates for each of the ADF components -activities, datasets, triggers, pipelines- which allows easing the creation of new pipelines by reusing the ones already created.

Those ADF component templates are the JSON structures that defines the component but containing placeholders. When the placeholders are resolved with actual values, the resulting JSON structure can be used to create the component in ADF.

ADF component templates

Sidra provides a set of ADF components templates that are used to create the workflows to ingest the data from the sources into the platform -specifically into Core- and also to move data from Core into the client apps. As commented before, the templates are the same JSON structures that define an ADF component but containing placeholders. The placeholders are identified by beginning and ending with ##. When the placeholders are resolved with actual values, the resulting JSON structure is the definition of a component that can be used to programmatically create the component in ADF. For example, these are a couple of templates used for defining a trigger:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
{
    "name": "##name##",
    "properties": {
        "type": "BlobEventsTrigger",
        "typeProperties": {
            "blobPathBeginsWith": "##basePath##",
            "scope": "##scope##",
            "events": [
                "Microsoft.Storage.BlobCreated"
            ]
        },
        "pipelines": [
            ##Pipelines##
        ]
    }
}

it contains the following placeholders:

  • ##name##: it will be resolved with the name of the trigger, e.g. "Core Storage Blob Created"
  • ##basePath##: it will be resolved with the path of the blob where the trigger will be checking if there are new files, e.g. "/landing/blobs/"
  • ##scope##: it will be resolved to the reference to the Azure Storage account, e.g. "/subscriptions/XXXXXXXX-XXXX-XXXX-XXXX-XXXXXXXXXXXX/resourceGroups/Dev.NorthEurope/providers/Microsoft.Storage/storageAccounts/xxxxx"
  • ##Pipelines##: it will be resolved with the pipelines that are executed by this trigger. Those pipelines are obtained by the resolution of the following Pipeline reference template, which at the same time is using the ##pipelineName## placeholder
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
{
    "pipelineReference": {
        "type": "PipelineReference",
        "referenceName": "##pipelineName##"
    },
    "parameters": {
        "folderPath": "@triggerBody().folderPath",
        "fileName": "@triggerBody().fileName"
    }
}

Every ADF component that defines a template, also defines a set of default values for the placeholders of the template. Although, it can happen that there is no default value for a specific placeholder.

The values used to resolve the placeholders are contained in the metadata database. So by using ADF component templates and their resolving values, Sidra can automatically recreate -create, update, delete- all the data workflows. All this work -resolving the placeholders and updating the Data Factories- is implemented by the Data Factory Manager.

The way to extend the Sidra platform with new workflows is by adding more information to the metadata database that defines new pipelines, activities, triggers, etc.

DataFactory table

This table contains the information of the Data Factories configured in the platform.

Column Description
Id DataFactory identifier
Name Name of the Azure Data Factory resource
ResourceGroup Name of the Azure Resource Group where the Data Factory resides

ADF Triggers

Sidra platform stores the information about triggers in the tables:

TriggerTemplate table

This table stores the information about a trigger template.

Column Description
Id TriggerTemplate identifier
ItemId The global identifier of the template. It is unique within all the trigger templates in all the Sidra installations
Name Name of the TriggerTemplate, e.g. "EventTrigger"
Description Description of the TriggerTemplate, e.g. "Detects when a file has been created into Blob Storage and raise configured pipelines"
Template JSON template defining the trigger. As a template, it contains placeholders
PipelineTemplate JSON template defining the pipelines to which the trigger will apply
DefaultValue Values to resolve the template placeholders by default

Trigger table

This table stores the information about a trigger. It is related with a TriggerTemplate and it contains the parameters to resolve the placeholders used in it.

Column Description
Id Trigger identifier
ItemId The global identifier of the trigger. It is unique within all the triggers in all the Sidra installations
IdTemplate Identifier of the TriggerTemplate
Name Name of the Trigger, e.g. "Core Storage Blob Created"
Description Description of the Trigger, e.g. "Trigger to extract daily content from external data sources"
Parameters Values used to resolve the TriggerTemplate placeholders
IdDataFactory Identifier of the Data Factory where the trigger will be created
IsRemoved Identifies if the trigger should be created or not
LastUpdated Date of the last time that the trigger was updated
LasDeployed Date of the last time that the trigger was created in the Data Factory

TriggerPipeline table

This table stores the relationship between Triggers and Pipelines, so it identifies which Pipelines are triggered by which Triggers.

Column Description
IdTrigger Identifier of the Trigger
IdPipeline Identifier of the Pipeline

How it works

Sidra defines triggers based on the triggers templates. Each trigger:

  • is related to the Data Factory where the trigger will be created
  • is related to the Pipelines that the Trigger executes
  • provides some parameters that will be resolved with the placeholders of the trigger template

Those parameters are defined in a JSON format with this semantic:

  • the properties are the name of the placeholders to be replaced without the outer ##
  • the values are the value used to replace the placeholder

This is an example of parameters in a trigger:

1
2
3
4
{
    "basePath": "##landingZone/1/BasePath##",
    "scope": "##landingZone/1/StorageResourceId##"
}

So, in the trigger template associated to this trigger:

  • the placeholder ##basePath## will be replaced by ##landingZone/1/BasePath## and
  • the placeholder ##scope## will be replaced by ##landingZone/1/StorageResourceId##

As it can be seen in the previous example, the resolution has introduced two new placeholders in the trigger template. That is ok since the trigger template goes through several rounds of resolution of placeholders.

The resolution of placeholders can be divided in two parts, (1) generate the pipeline references and (2) generate the trigger JSON file used to create the trigger in Data Factory.

Pipeline reference generation

It uses the pipelines associated to the trigger and the Pipeline reference template to obtain a JSON file with all the pipeline references. Then it replaces the placeholder ##Pipelines## in the trigger template with the JSON file.

Trigger generation

It involves going through the following rounds. Every round resolves the Parameter source with the trigger template:

Round Parameters source
1 Trigger parameters. This is the one that has been explained above. It takes the parameters included in Trigger and replace them in the TriggerTemplate.
2 Default values from the TriggerTemplate. Also commented previously. All templates -including TriggerTemplates- have defined a DefaultValue for the placeholders defined in the template. If some of the placeholders have not been replaced in the round 1 then they will be replaced with the default values of the template.
3 Global placeholders. It resolves some global placeholders that can be used in different types of templates, e.g. the landingZone placeholders -like landingZone/1/BasePath- and the dates placeholders -like DateTime.Today.Time-.
4 Configuration placeholders with their corresponding values. They are stored -both placeholder and value to resolve- in the [Management].[Configuration] table in the Core database.

After all these rounds, the resulting JSON file is used to create the trigger in the Data Factory.

ADF Dataset

A dataset is a "named view of data that simply points or references the data you want to use in your activities as inputs and outputs". Data Factory defines several types of datasets depending on the source of the data, e.g. Azure Storage Blob, Azure SQL Database, Hive, etc.

Besides the type, datasets can define parameters. It is important to differentiate between the placeholders defined in the dataset template and the parameters of the dataset.

  • The placeholders in the template are meant to be a mechanism created by Sidra to reuse the information of ADF components and accelerate the creation of new data workflows, they are recognizable by the syntax ##placeholder##.

  • The parameters are a feature of ADF to reuse the same dataset in different context by using different values in the parameters, they are defined as part of the JSON configuration.

For example, the following dataset template contains a placeholder -##linkedService##- and two parameters -folderPath and fileName-.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
{
    "name": "LandingZoneFileDataset",
    "properties": {
        "type": "AzureBlob",
        "linkedServiceName": {
            "referenceName": "##linkedService##",
            "type": "LinkedServiceReference"
        },
        "typeProperties": {
            "folderPath": "@dataset().folderPath",
            "fileName": "@dataset().fileName",
            "format": {
                "type": "TextFormat",
                "rowDelimiter": "\n"
            }
        },
        "parameters": {
            "folderPath": {
                "type": "String",
                "defaultValue": ""
            },
            "fileName": {
                "type": "String",
                "defaultValue": ""
            }
        }
    }
}

DatasetTemplate table

This table stores the information about a dataset template that can be used in any pipeline template.

Column Description
Id DatasetTemplate identifier. It is unique within all the dataset templates in the same Sidra installation
ItemId The global identifier of the template. It is unique within all the dataset templates in all the Sidra installations
Name Name of the DatasetTemplate, e.g. "LandingZoneDataset"
Description A description to tell what the dataset template does, e.g. "Selects a folder in the Blob."
Template A JSON structure defining the dataset. As it is a template, it can contain placeholders
DefaultValue A JSON structure that contains key-value pairs. The keys are placeholder names and the value is the default value to replace placeholder in case those values are not specified by other elements associated to the dataset template.

Sidra does not store datasets by themselves the same way it does with triggers. It only stores dataset templates. The generation of the JSON configurations to create the datasets in Data Factory is implemented as part of the generation of the pipelines.

ADF Activities

Every activity template can refer to a native task supported in Data Factory -native activity- or to a custom task specifically created for a particular situation -custom activity-. Sidra provides a set of custom activities that covers all the tasks required by the pipelines templates that are not natively supported. Additional custom activities can be implemented and integrated in Sidra if necessary.

Activities, as datasets, can define parameters. The set of parameters of each activity can be consulted in Azure Data Factory documentation -for native activities- or in the documentation of the custom activity.

ActivityTemplate table

This table is used to define activity templates that represents an ADF activity but containing some placeholders that will be substituted in order to generate a valid JSON structure that defines an activity in ADF. The activity templates are associated to pipeline templates and can be reused by several of them.

Column Description
Id ActivityTemplate identifier. It is unique within all the activity templates in the same Sidra installation
ItemId The global identifier of the template. It is unique within all the activity templates in all the Sidra installations
Name Name of the ActivityTemplate, e.g. "Lookup"
Description A description to tell what the ActivityTemplate does, e.g. "Executes a lookup activity"
Template A JSON structure defining the activity. As it is a template, it can contain placeholders
DefaultValue A JSON structure that contains key-value pairs. The keys are placeholder names and the value is the default value to replace placeholder in case those values are not specified by other elements associated to the activity template.
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
{
    "name": "##lookupBlobActivityName##",
    "type": "Lookup",
    "dependsOn": [##dependsOn##],
    "policy": {
        "timeout": "03:00:00",
        "retry": 0,
        "retryIntervalInSeconds": 30,
        "secureOutput": false
    },
    "typeProperties": {
        "source": {
            "type": "BlobSource",
            "recursive": true
        },
        "dataset": {
            "referenceName": "##lookupBlobDatasetName##",
            "parameters": {             
                "fileName": "##lookupBlobFileName##",
                "folderPath": "##lookupBlobFolderPath##"
            },
            "type": "DatasetReference"
        },
        "firstRowOnly": ##firstRowOnly##
    }
}

The same way it happens to datasets, Sidra does not store activities by themselves but only activity templates. The generation of the JSON configurations to create the activities in Data Factory is implemented as part of the generation of the pipelines.

ADF Pipelines

Pipelines, same as datasets and activities, can define parameters, so in each execution of the pipeline can be triggered with a different set of values for those parameters. The pipelines define a list of activities to execute sequentially.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
{
    "name": "PipelineName",
    "description": "Pipeline description",
    "properties": {
        "parameters": {
            "parameterName01": {
                "type": "DateTime"
            },
            "parameterName02": {
                "type": "String"
            },
            "parameterName03": {
                "type": "Int"
            },
            ...
        },
        "activities": [
            {
                "name": "Activity01",
                "type": "Lookup"
                ...
            },
            {
                "name": "Activity02",
                "type": "SqlServerStoredProcedure"
                ...
            },
            ...
        ]
    }
}

The pipeline templates take advantage of the templates defined previously (dataset and activity templates) to configure the pipeline. So, the definition of a pipeline template contains:

  • the template of the pipeline
  • the default values for the template of the pipeline
  • a list of dataset templates along with the parameters used to replace the placeholders of each dataset template
  • an ordered list of activity templates along with the parameters used to replace the placeholders of each activity template

The following tables are used to represent the above information.

PipelineTemplate table

This table is used to define pipeline templates that can be used by any pipeline.

Column Description
Id The identifier of the template
ItemId The global identifier of the template. It is unique within all the pipeline templates in all the Sidra installations
Name Name of the template, e.g. "AutoGeneratedImportFile"
Description A description to tell what the template does
Template JSON structure defining the pipeline. As it is a template, it can contain placeholders
DefaultValue A JSON structure that contains key-value pairs. The keys are placeholder names and the value is the default value to replace placeholder in case those values are not specified by other elements associated to the pipeline template.
IdPipelineTemplatePlatform Identifier of the platform intended for the pipeline template
IdPipelineType Identifier the type of pipeline

The PipelineTemplate table can store templates for Azure Data Factory pipelines but also for Azure Search pipelines, so the field IdPipelineTemplatePlatform indicates which is the intended platform in which the pipeline template can be used.

The pipeline templates are categorized by types based on their intended functionality, so the Data Factory Manager and other Sidra components can recognise the purpose of the pipeline. The field IdPipelineType selects the category -or type- of pipeline template.

PipelineTemplatePlatform table

This table is used to store the list of available platforms for which pipeline templates can be created.

Column Description
Id Pipeline template platform identifier
Name Name of the platform where the pipeline template will be used

These are the available platforms:

Id Name
1 AzureDataFactory
2 AzureSearch

PipelineTemplateType table

This table is used to store the different categories or types of pipeline templates based on their functionality.

Column Description
Id Pipeline template type identifier
Name Name of the type -or category- of the pipeline template

These are the available types of pipeline templates:

Id Name
1 Ingestion
2 System
3 DW
4 Log
5 Other
6 LoadRegisteredAsset
7 AzureSearchIndexer

DatasetTemplatePipelineTemplate table

This table links dataset templates with pipeline templates. It is used to build the structure of a pipeline template, telling what dataset the activities inside the pipeline are going to use and parameters if necessary.

Column Description
Id Identifier of the link
ItemId The global identifier of the link. It is unique within all the link in all the Sidra installations
IdPipelineTemplate The identifier of the pipeline template used
IdDatasetTemplate The identifier of the dataset template used in the pipeline template
Parameters Parameters to replace in the dataset template that are used across all pipelines that use this pipeline template

ActivityTemplatePipelineTemplate table

This table links activity templates with pipeline templates. It is used to build the structure of a pipeline template, telling what activities are going to use, in which order and which parameters if necessary.

Column Description
Id Identifier of the link
ItemId The global identifier of the link. It is unique within all the links in all the Sidra installations
IdPipelineTemplate The identifier of the pipeline template used
IdActivityTemplate The identifier of the activity template used in the pipeline template
Order The position of the activity inside the pipeline
Parameters Parameters to replace in the activity template that are used across all pipelines that use this pipeline template

EntityPipeline table

This table associates entities with pipelines. In case the template of the pipeline has placeholders referencing to an entity metadata, this table will be used to get the entity associated and access to the information used to replace those placeholders.

Column Description
IdPipeline The pipeline identifier
IdEntity The entity identifier

Pipeline

This table defines actual pipelines that use the pipelines templates built and that are going to be deployed in the Azure Data Factory service.

Column Description
Id Identifier of the pipeline
ItemId The global identifier of the pipeline. It is unique within all the pipelines in all the Sidra installations
Name Name of the pipeline
ValidFrom Indicates the date the pipeline starts working
ValidUntil Indicates the date the pipeline should not continue working
IdTemplate Specifies what template is used by this pipeline
LastUpdated Tells the last date the pipeline has been updated. If its value is greater than LastDeployed, the pipeline will be deployed to Azure Data Factory next time the Data Factory Manager is executed
LastDeployed Tells the last date the pipeline has been deployed
IdDataFactory Tells what Data Factory this pipeline must be deployed on
IsRemoved Tells if the Data Factory Manager should remove this pipeline from Azure Data Factory
Parameters A JSON structure that contains key-value pairs. The keys are placeholder names and the value is the default value to replace placeholder

Sidra defines pipelines based on the pipelines templates. Each pipeline is related to a Data Factory where the pipeline will be created and also provides some parameters that will be resolved with the placeholders of the pipeline template. The parameters follow the same structure that the ones for triggers. They are a JSON structure with pairs of: placeholder name, value used for replacement.

Same as triggers, the resolution of parameters follows a list of resolution rounds since it has to replace the placeholders for:

  • the pipeline template associated to the pipeline
  • all the dataset templates associated to the pipeline template
  • all the activity templates associated to the pipeline template

The process of resolution of parameters can be divided in two parts, (1) generate the datasets JSON files and create them in Data Factory and (2) generate the pipeline JSON files -which includes the activities- and create both of them in Data Factory.

Dataset generation

This is the order of resolution of parameters in the dataset template associated to the pipeline template of the pipeline.

Round Parameters source
1 From the association between the pipeline template and the dataset template
2 From the pipeline
3 Default values included in the dataset template
4 Default values included in the pipeline template
5 Global placeholders that can be used in different types of templates
6 Configuration placeholders with their corresponding values. They are stored in the [Management].[Configuration] table in the Core database.
7 Entity placeholders. In case of there is a unique entity associated to the pipeline, a list of entity placeholders will be replaced by the respective values in that entity.

This is the list of entity placeholders:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
- ##Entity.TableName##
- ##Entity.Provider.Id##
- ##Entity.Provider##
- ##Entity.Provider.DatabaseName##
- ##Entity.Id##
- ##Entity.ProviderWithoutDashes##
- ##Entity.DataDelay##
- ##Entity.SourcePath##
- ##Entity.RowDelimiter##
- ##Entity.NullText##
- ##Entity.FieldDelimiter##

After all these resolutions, the resulting JSON file is used to create the dataset in the Data Factory.

Pipeline and activities generation

For each activity template associated to the pipeline template of the pipeline, this is the order of resolution of parameters:

Round Parameters source
1 From the association between the pipeline template and the activity template
2 From the pipeline
3 Default values included in the activity template

After all these resolutions have been carried out, the resulting JSON structure is included in the pipeline template and then these resolutions are carried out:

Round Parameters source
1 From the pipeline. This second time that the parameters from pipeline are resolved is because in the previous one they are only applied to dataset templates, now it includes the pipeline template.
2 Default values included in the pipeline template
3 Global placeholders that can be used in different types of templates
4 Configuration placeholders with their corresponding values. They are stored in the [Management].[Configuration] table in the Core database.
5 entity placeholders. In case of there is a unique entity, a list of entity placeholders will be replaced by the respective values in that entity.

After all these resolutions, the resulting JSON structure is used to create the pipeline -which include the activities- in the Data Factory.

Triggers association

Pipelines can also be associated to several -or none- triggers. The implementation of this relationship in Data Factory is carried out by the trigger creation table. For this reason, the Data Factory Manager creates first the Pipelines and then the Triggers that will need the pipelines already created to implement the relationship.