Sidra Data Platform orchestration process¶
Data movements in Sidra Data Platform can be performed in many ways. In order to ease the development efforts, Sidra provides a standard way to perform data movements using Azure Data Factory V2 (ADF). Azure Data Factory was chosen for a number of reasons:
Platform as a Service
ADF works completely on the cloud as an PaaS service, so there is no need of any virtual (or real) machine to execute an ADF process, as would be the case of some wide used technologies like SSIS. Removing the need of a dedicated machine reduces the complexity of deploy and the maintenance effort required, and improves the scalability.
No Licensing Costs
There is no licensing cost associated, like most third party products. Costs are based on usage only. As part of the Azure infrastructure, it benefits for any existing Enterprise Agreement between the client (whose solution will use Sidra) and Microsoft.
ADF provides many built-in features, growing on a regular basis, that covers the most common scenarios. However, it can be expanded with custom actions if needed. Also, ADF supports the execution of .NET code to cover any needs not implemented as a built-in feature.
Monitor and manage out of the box
The Data Factory service provides a monitoring dashboard with a complete view of storage, processing, and data movement services. System health and issues can be tracked and corrective actions taken using this dashboard.
Works across on-premises and the cloud
Data Factory works across on-premises and cloud data sources.
Fully integrated in Azure infrastructure (ARM)
ADF integrates with Azure Resource Manager. This means Azure management and deployment capabilities are leveraged when using Data Factory.
What is Data Factory¶
Data Factory is a cloud-based data integration service that allows you to create data-driven workflows in the cloud for orchestrating and automating data movement and data transformation. In ADF, we can create and schedule data-driven workflows (called pipelines) that can ingest data from disparate data stores.
How does it work¶
In Azure Data Factory the pipelines (data-driven workflows) typically perform the following four steps:
Connect & Collect
In most of the cases the enterprises have data of various types that are located in disparate sources on-premises, in the cloud, structured, unstructured, and semi-structured, all arriving at different intervals and speeds. In order to create the centralized information system for subsequent processing, all these sources need to connect and provide the data. Azure Data Factory, centralize all these connectors, avoiding in this way to build custom data movement components or write custom services to integrate these data sources and processing, which are expensive and hard to integrate and maintain such systems.
Transform & Enrich
Once the data is centralized, using computed services it is processed and transform.
After the raw data has been refined into a business-ready consumable form, the data is loaded and publish in order to be able to be consumed by client applications.
Once the data is published it can me monitored in order to detect anomalies, improve processing and observe production.
Azure Data Factory is composed of four key components that work together to provide to compose data-driven workflows with steps to move and transform data. Summarizing, these are the basic components used to create the data workflows in ADF:
- Connections: Linked Services and Integration Runtime
The ADF portal is a website that allows to manage all the components of ADF in a visual and easy way.
But also, every component can be defined by using a JSON file, that can be used to programmatically manage ADF. The complete definition of a data workflow can be composed by a set of JSON files. For example, this is the definition -or configuration- of a trigger that runs a pipeline when a new file is stored in an Azure Storage Blob account:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25
Data Factory pipelines (do not mistake with the Microsoft DevOps service called Azure Pipelines) are composed by independent activities that perform one task each. When those activities are chained, it is possible to make different transformations and moving of the data from the source to the destination. For example, a pipeline can contain a group of activities that ingest data and then transform it by applying a query. ADF can execute multiple pipelines at same time.
The management of ADF pipelines is automatized by a key architectural piece, the Data Factory Manager. This accelerator enables programmatic management of ADF pipelines, and greatly streamlines any changes that need to happen due to new data sources or changes in the existing ones avoiding human/manual interaction.
Activities represent a processing step in a pipeline. For example, copy data from one data store to another data store. There are several tasks natively supported in Data Factory, like the copy activity, which just copies data from a source to a destination. However, there are some processes that are not supported natively in Data Factory. For custom behaviors ADF provides a type of activity named Custom Activity which allows the user to implement specific domain business logic. Custom Activities can be coded in any programming language supported by Microsoft Azure Virtual Machines (Windows or Linux).
Pipelines can be executed manually -by means of the ADF portal, PowerShell, .NET SDK, etc.- or can be automatically executed by a Trigger. Triggers, same as pipelines, are configured in ADF and they are one of its basic components.
Connections: Linked Services and Integration Runtime¶
To reference the systems where the data is located (such an SFTP server, or a SQL Server database), there are objects called Linked Services. They basically behave as if they were connection strings. Sometines it is necessary to access some data store which is not reachable from the Internet. In this cases, it is necessary to use what is known as Integration Runtime, which a software that allows Data Factory to communicate with the data store using an environment as a gateway. In the ADF portal, linked services and integration runtime are grouped in a category called Connections.
To reference data inside a system (for example, a file in an SFTP server or a table in a SQL Server database), there are objects called Datasets. The activities can reference zero or more datasets (it depends what type of activity is used).
Data Factory Manager¶
The management of all the ADF components is automatized by a key architectural piece in Sidra, the Data Factory Manager. This accelerator enables programmatic management of ADF components, and greatly streamlines any changes that need to happen due to new data sources or changes in the existing ones.
In order to achieve that, it uses the information stored in the Core database about the ADF components that are predefined in the Sidra platform. Those "predefined" components are templates that contain placeholders. When the placeholders are resolved with actual values, the resulting component can be used in ADF.
Sidra platform provides a set of predefined ADF components that are used to create the workflows to ingest the data from the sources into the platform and also to move data from Core into the client apps.