Skip to content

Sidra Data Platform orchestration process

Data movements in Sidra Data Platform can be performed in many ways. In order to ease the development efforts, Sidra provides a standard way to perform data movements using Azure Data Factory V2 (ADF). Azure Data Factory was chosen for a number of reasons:

  • Platform as a Service

    ADF works completely on the cloud as an PaaS service, so there is no need of any virtual (or real) machine to execute an ADF process, as would be the case of some wide used technologies like SSIS. Removing the need of a dedicated machine reduces the complexity of deploy and the maintenance effort required, and improves the scalability.

  • No Licensing Costs

    There is no licensing cost associated, like most third party products. Costs are based on usage only. As part of the Azure infrastructure, it benefits for any existing Enterprise Agreement between the client (whose solution will use Sidra) and Microsoft.

  • Open architecture

    ADF provides many built-in features, growing on a regular basis, that covers the most common scenarios. However, it can be expanded with custom actions if needed. Also, ADF supports the execution of .NET code to cover any needs not implemented as a built-in feature.

  • Monitor and manage out of the box

    The Data Factory service provides a monitoring dashboard with a complete view of storage, processing, and data movement services. System health and issues can be tracked and corrective actions taken using this dashboard.

  • Works across on-premises and the cloud

    Data Factory works across on-premises and cloud data sources.

  • Fully integrated in Azure infrastructure (ARM)

    ADF integrates with Azure Resource Manager. This means Azure management and deployment capabilities are leveraged when using Data Factory.

What is Data Factory

Data Factory is a cloud-based data integration service that allows you to create data-driven workflows in the cloud for orchestrating and automating data movement and data transformation. In ADF, we can create and schedule data-driven workflows (called pipelines) that can ingest data from disparate data stores.

How does it work

In Azure Data Factory the pipelines (data-driven workflows) typically perform the following four steps:


  • Connect & Collect

    In most of the cases the enterprises have data of various types that is located in disparate sources on-premises, in the cloud, structured, unstructured, and semi-structured, all arriving at different intervals and speeds. In order to create the centralized information system for subsequent processing, all these sources need to connect and provide the data. Azure Data Factory, centralize all these connectors, avoiding in this way to build custom data movement components or write custom services to integrate these data sources and processing, which are expensive and hard to integrate and maintain such systems.

  • Transform & Enrich

    Once the data is centralized, using computed services it is processed and transformed.

  • Publish

    After the raw data has been refined into a business-ready consumable form, the data is loaded and publish in order to be able to be consumed by client applications.

  • Monitor

    Once the data is published it can me monitored in order to detect anomalies, improve processing and observe production.

ADF components

Azure Data Factory is composed of four key components that work together to provide to compose data-driven workflows with steps to move and transform data. Summarizing, these are the basic components used to create the data workflows in ADF:

  • Pipelines
  • Activities
  • Triggers
  • Connections: Linked Services and Integration Runtime
  • Datasets

The ADF portal is a website that allows to manage all the components of ADF in a visual and easy way.

Moreover, every component can be defined by using a JSON file, that can be used to programmatically manage ADF. The complete definition of a data workflow can be composed by a set of JSON files. For example, this is the definition -or configuration- of a trigger that runs a pipeline when a new file is stored in an Azure Storage Blob account:

    "name": "Core Storage Blob Created",
    "properties": {
        "type": "BlobEventsTrigger",
        "typeProperties": {
            "blobPathBeginsWith": "/landing/blobs/",
            "scope": "/subscriptions/XXXXXXXX-XXXX-XXXX-XXXX-XXXXXXXXXXXX/resourceGroups/Core.Dev.NorthEurope/providers/Microsoft.Storage/storageAccounts/corestadev",
            "events": [
        "pipelines": [
                "pipelineReference": {
                    "referenceName": "ingest-from-landing",
                    "type": "PipelineReference"
                "parameters": {
                    "folderPath": "@triggerBody().folderPath",
                    "fileName": "@triggerBody().fileName"


Data Factory pipelines (do not mistake with the Microsoft DevOps service called Azure Pipelines) are composed by independent activities that perform one task each. When those activities are chained, it is possible to make different transformations and move the data from the source to the destination. For example, a pipeline can contain a group of activities that ingest data and then transform it by applying a query. ADF can execute multiple pipelines at same time.

The management of ADF pipelines is automatized by a key architectural piece, the Data Factory Manager. This accelerator enables programmatic management of ADF pipelines, and greatly streamlines any changes that need to happen due to new data sources or changes in the existing ones avoiding human/manual interaction.


Activities represent a processing step in a pipeline. For example, copy data from one data store to another data store. There are several tasks natively supported in Data Factory, like the copy activity, which just copies data from a source to a destination. However, there are some processes that are not supported natively in Data Factory. For custom behaviours ADF provides a type of activity named Custom Activity which allows the user to implement specific domain business logic. Custom Activities can be coded in any programming language supported by Microsoft Azure Virtual Machines (Windows or Linux).


Pipelines can be executed manually -by means of the ADF portal, PowerShell, .NET SDK, etc.- or can be automatically executed by a Trigger. Triggers, same as pipelines, are configured in ADF and they are one of its basic components.

Connections: Linked Services and Integration Runtime

To reference the systems where the data is located (such as an SFTP server, or a SQL Server database), there are objects called Linked Services. They basically behave as if they were connection strings. Sometimes it is necessary to access some data store which is not reachable from the Internet. In this cases, it is necessary to use what is known as Integration Runtime, which a software that allows Data Factory to communicate with the data store using an environment as a gateway. In the ADF portal, linked services and integration runtime are grouped in a category called Connections.

Integration Runtimes

Integration Runtimes are, basically, sets of (virtual) machines that execute Data Factory pipeline activities. Post deployment, in certain scenarios, Sidra may need custom or on-premises Integration Runtimes (IRs). These will require a Java Runtime Environment on their nodes, as per Sidra's Databricks needs.

Sidra is making use of Azure's Data Factory to ingest the data, to copy it from sources in the Databricks. Data is ingested by executing Data Factory pipelines, composed of activities. The activities are being processed in Integration Runtimes.

Azure Data Factory comes with a default Azure-based Microsoft-managed AutoResolveIntegrationRuntime. Sidra, upon deployment, also adds a default Integration Runtime, although no nodes are initially added there.

There are certain setups - like those blocking firewall traversal - that would prevent activities from running in the Data Factory's default AutoResolveIntegrationRuntime. In such situations, custom or on-premises nodes would need to be added: to the Sidra's default Integration Runtime or any other additional one.

Adding Integration Runtime nodes

An Integration Runtime is, basically, a set nodes for activity execution. To understand how it works, we only ask "How is Data Factory pushing activities on the Integration Runtime nodes?"

First, an agent gets installed on the machine that would become a node. Then, we "tell" that agent "how" to connect to the Data Factory to receive workload, activities to execute. Basically, we configure the agent with an endpoint and a security key: through that endpoint, the agent connects to the Data Factory to receive workload. This connectivity info is found in the Integration Runtime's key(s).

Change the Integration Runtime

Once a node was added to an Integration Runtime, the UI of the agent on the node does not allow re-assigning the node to a different Integration Runtime set.
But the agent does come with a PowerShell script allowing to re-register the node:
C:\Program Files\Microsoft Integration Runtime\5.0\PowerShellScript\RegisterIntegrationRuntime.ps1

Find in the product documentation the details about creating an Integration Runtime set and adding nodes to it.


Integration Runtimes using OpenJDK

More information about integration runtimes and OpenJDK implementation can be checked in this tutorial .


To reference data inside a system (for example, a file in an SFTP server or a table in a SQL Server database), there are objects called Datasets. The activities can reference zero or more datasets (it depends what type of activity is used).

Data Factory Manager

The management of all the ADF components is automatized by a key architectural piece in Sidra, the Data Factory Manager. This accelerator enables programmatic management of ADF components, and greatly streamlines any changes that need to happen due to new data sources or changes in the existing ones.

In order to achieve that, it uses the information stored in the Core database about the ADF components that are predefined in the Sidra platform. Those "predefined" components are templates that contain placeholders. When the placeholders are resolved with actual values, the resulting component can be used in ADF.

Sidra platform provides a set of predefined ADF components that are used to create the workflows to ingest the data from the sources into the platform and also to move data from Core into the client apps.

Sidra Ideas Portal

Last update: 2022-09-05
Back to top