How to configure a new pipeline based on templates

In order to start with the configuration of a new pipeline, all the metadata regarding the asset should be set up. Once that setup is completed, it is time to populate the metadata database with the information of the Azure Data Factory pipelines.

The purpose of this section is to address how store the information that defines a pipeline in the database using the templates provided by Sidra. The tables used in this process are located in the DataIngestion schema of the metadata database and theirs definitions and descriptions can be found in Data Storage Units tables and Data Factory tables.

These are the steps to configure a new pipeline based on templates:

1. Review the provided PipelineTemplates

Sidra includes a set of pre-configured pipeline templates for the most common scenarios. Currently they are:

  • Files are pulled from an SQL Server
  • Files are pulled from an Azure SQL database

If one of those scenarios can be used then the steps 3 to 5 can be skipped.

If a different scenario is required then a new pipeline must be designed to achieve the requirements. Once it is designed, a new PipelineTemplate must be added to the metadata database and use it to create the pipeline. In order to create the PipelineTemplate, all the steps must be followed.

2. Configure the Linked Service to access to the data source

In order to allow to Data Factory to have access to the data in the data source, it is necessary to create and configure a Linked Service.

The first step would be to check the access to the data source. The data source could be inside a virtual private network and for that reason Data Factory will not be able to access to it. In this case, it might be necessary to install an Integration Runtime.

The second step would be to configure the source accordingly. In order to do so, it is necessary to create a new Linked Service to pull data from. How to add new linked services can be found in Connecting to new data sources through linked services

3. Create or configure a DatasetTemplate that uses the Linked Service

The pipelines in Data Factory use the linked services through a Dataset. The Dataset represents a view of the data in the data source. Sidra provides a set of DatasetTemplates that are parametrized. By configuring the parameters, a Dataset can be created from the DatasetTemplate.

It could happen that none of the DatasetTemplate could be applied to the specific situation, in that case a new DatasetTemplate must be created with the necessary parameters -so it can be reused-, and then configured -setup the parameters- so it can be used by the pipeline.

4. Create or configure the ActivityTemplates that are used by pipeline

A pipeline is basically a set of chained activities. Sidra provides a set of ActivityTemplates that defines the more common activities from Data Factory. The templates include parameters so the ActivityTemplates can be reused.

If the pipeline uses an activity that can be generated from a existing ActivityTemplate, it is just needed to configure the parameters.

If there is no ActivityTemplate for an activity used by the pipeline, it must be created. The new ActivityTemplate must include parameters so it can be reused.

5. Create or configure a PipelineTemplate

Once all the elements that the pipeline uses (datasets and activities) are defined by theirs corresponding templates. The new pipeline template can be created and configured using the previous mentioned templates.

6. Create a pipeline based on a PipelineTemplate

Once there is a PipelineTemplate that complies all the requirements -either because it has been specifically created or because it has been used one of the ones provided by Sidra-, a pipeline must be configured using the PipelineTemplate and the appropriate parameters.

7. Design how the pipeline will be executed

With the pipeline configured, it is time to consider what kind of trigger will be used to execute the pipeline. Follow the instruction provided in the sample scenario to add new triggers.

8. Deploy new elements in Data Factory

Go to Azure portal and execute the Azure Data Factory Manager to generate the pipelines, datasets and triggers.

Sample scenario using the 'Extract from Azure SQL Database' pipeline template

To start ingesting data from Azure SQL Database the following steps must be followed:

  • Step 1. Create a new Linked Service to pull data from by following Connecting to new data sources through linked services
  • Step 2. Insert a new pipeline in Pipeline table using the PipelineTemplate with ItemId 9B7DE610-B5DD-427D-9DDC-E1A44C495E71. This template is coming from the Sidra seed, so do not expect changes in the near future. In any case, the name of the PipelineTemplate is "SqlServerGenerated". Example:
1
2
3
4
5
6
SET IDENTITY_INSERT [DataIngestion].[Pipeline] ON
INSERT [DataIngestion].[Pipeline] 
    ([Id], [Name], [ValidFrom], [ValidUntil], [IsAutogenerated], [IdTemplate], [LastUpdated], [LastDeployed], [IdDataFactory], [IsRemoved], [Parameters])
VALUES 
    (20, N'load-from-sql', CAST(N'2019-01-09' AS Date), NULL, 0, 5, GETUTCDATE(), NULL, 1, 0, N'{}')
SET IDENTITY_INSERT [DataIngestion].[Pipeline] OFF
  • Step 3. Add the relationship between the Pipeline and the entity by adding a record in EntityPipeline table. The dataingestion-parametrized must be assigned to the entity as well:
1
2
3
4
5
INSERT [DataIngestion].[EntityPipeline] 
     ([IdPipeline], [IdEntity])
VALUES
     (1, 2)
    ,(20, 2)
  • Step 4. Since quite often SQL databases do not push data, it is necessary to configure a trigger to pull the data from the database. The frequency of how many times the trigger will be triggered within a day may vary depending on the requirements. Anyhow, it is necessary to add a new trigger template, a trigger and the association between the trigger and pipeline. Example:
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
SET IDENTITY_INSERT [DataIngestion].[TriggerTemplate] ON 
INSERT [DataIngestion].[TriggerTemplate] 
    ([Id], [Name], [Description], [Template], [PipelineTemplate], [DefaultValue]) 
VALUES 
    (2, N'Hourly execution', N'Execute the provided pipelines every hour', N'{
           "name": "##name##",
              "properties": {

                "type": "ScheduleTrigger",
                "typeProperties": {
                  "recurrence": {
                    "frequency": "Hour",
                    "interval":1,
                    "startTime": "##startTime##",
                    "endTime": "##endTime##"
                }},
               "pipelines": [
                        ##Pipelines##
                  ]
            }
            }', N'{
                "pipelineReference": {
                    "type": "PipelineReference",
                    "referenceName": "##pipelineName##"
                },
                 "parameters": {
                    "fileDate": {
                        "type": "Expression",
                        "value": "@trigger().scheduledTime"
                    }
                }
            }', N'{"startTime":"2018-03-01T00:00:00-00:00", "endTime":"2099-01-09T00:00:00-00:00"}')
SET IDENTITY_INSERT [DataIngestion].[TriggerTemplate] OFF

SET IDENTITY_INSERT [DataIngestion].[Trigger] ON 
INSERT [DataIngestion].[Trigger] 
    ([Id], [IdTemplate], [Name], [Description], [Parameters], [IdDataFactory], [IsRemoved], [LastUpdated], [LastDeployed]) 
VALUES 
    (2, 2, N'Hourly daily execution', N'Trigger to extract hourly content from external data sources', N'{"startTime":"2018-07-03T00:15:00-00:00", "endTime":"2099-01-09T00:00:00-00:00"}', 1, 0, GETUTCDATE(), NULL)
SET IDENTITY_INSERT [DataIngestion].[Trigger] OFF

INSERT [DataIngestion].[TriggerPipeline] 
    ([IdTrigger], [IdPipeline]) 
VALUES 
    (2, 20)
  • Step 5. Go to Azure portal and execute the Azure Data Factory Manager to generate the pipelines, datasets and triggers.