ADF Dataset¶
A dataset is a "named view of data that simply points or references the data you want to use in your activities as inputs and outputs". Data Factory defines several types of datasets depending on the source of the data, e.g. Azure Storage Blob, Azure SQL Database, Hive, etc.
Besides the type, datasets can define parameters. It is important to differentiate between the placeholders defined in the dataset template and the parameters of the dataset.
-
The placeholders in the template are meant to be a mechanism created by Sidra to reuse the information of ADF components and accelerate the creation of new data workflows, they are recognizable by the syntax ##placeholder##.
-
The parameters are a feature of ADF to reuse the same dataset in different context by using different values in the parameters, they are defined as part of the JSON configuration.
Example¶
The following dataset template contains a placeholder -##linkedService##- and two parameters -folderPath and fileName-.
{
"name": "LandingZoneFileDataset",
"properties": {
"type": "AzureBlob",
"linkedServiceName": {
"referenceName": "##linkedService##",
"type": "LinkedServiceReference"
},
"typeProperties": {
"folderPath": "@dataset().folderPath",
"fileName": "@dataset().fileName",
"format": {
"type": "TextFormat",
"rowDelimiter": "\n"
}
},
"parameters": {
"folderPath": {
"type": "String",
"defaultValue": ""
},
"fileName": {
"type": "String",
"defaultValue": ""
}
}
}
}
DatasetTemplate table¶
This table stores the information about a dataset template that can be used in any pipeline template.
Column | Description | Format | Required |
---|---|---|---|
Id | DatasetTemplate identifier. It is unique within all the dataset templates in the same Sidra installation | int | Yes |
ItemId | The global identifier of the template. It is unique within all the dataset templates in all the Sidra installations | uniqueidentifier | Yes |
Name | Name of the DatasetTemplate, e.g. "LandingZoneDataset" | varchar(80) | Yes |
Description | A description to tell what the dataset template does, e.g. "Selects a folder in the Blob." | nvarchar(512) | |
Template | A JSON structure defining the dataset. As it is a template, it can contain placeholders | nvarchar(max) | Yes |
DefaultValue | A JSON structure that contains key-value pairs. The keys are placeholder names and the value is the default value to replace placeholder in case those values are not specified by other elements associated to the dataset template. | nvarchar(max) | Yes |
Sidra does not store datasets by themselves the same way it does with triggers. It only stores dataset templates. The generation of the JSON configurations to create the datasets in Data Factory is implemented as part of the generation of the pipelines.