Sidra Client pipelines

The Pipeline table in Client apps has some additional fields regarding its counterpart in Core, described in this documentation.

PipelineSyncBehvaiour

PipelineSyncBehaviour. The behaviour kind that is going to be from the Sync webjob perspective. Sync webjob is the responsible to synchronize metadata and execute the pipelines associated with the entities; when a pipeline is being executed, Sync webjob will provide several parameters to the pipeline. Depending on the behaviour, the pipelines will be:

  • Ignored: Pipeline will be ignored by Sync webjob, so probably that pipeline should be executed using a Trigger.
  • LoadAllNewDates: All completed-date Assets will be included to be extracted by Sync webjob execution of the pipeline.
  • LoadFirstNewDate: Only the oldest set of Asset data completed will be included to be extracted by Sync webjob execution of the pipeline.
  • LoadLastNewDate: Only the newer set of Asset data completed will be included to be extracted by Sync webjob execution of the pipeline.

  • ExecutionParameters. It used for including values in the parameters of the pipeline when it is executed by the Sync webjob.

Sync pipeline requirements

All the pipelines used by the Sync webjob (all except with behaviour Ignored) require using a PipelineTemplate with following parameters. The value of the parameters will be populated automatically by the Sync webjob: + storageName: The storage name used to drop the content. This storage account is created by the deployment. + storageToken: The SAS container token of the storage used. + containerName: The container name where the content will be dropped. + queryParameters: Some parameters populated, which include the assets that are going to be extracted per entity from the Data Storage Unit. + overwrite: A parameter -which in the template is True by default; it can be overriden in the ExecutionParameters for your pipeline- to indicate whether the content will be appended or cleaned and dropped.

From the point of view of the PipelineTemplate, an example of the section JSON with the parameters will be the following ones as the minimun required:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
 "parameters": {
    "storageName": {
        "type": "string"
    },
    "storageToken": {
        "type": "string"
    },
    "containerName": {
        "type": "string"
    },
    "queryParameters": {
        "type": "array"
    },
    "overwrite": {
        "type": "string",
        "defaultValue": "true"
    }
}

And more than these parameters, depending on the PipelineTemplate more parameters can be used with the purpose of adding more for the execution.

Complete set of data

When an Entity is defined there is a field named FilesPerDrop (further info in metadata description) which refers how many files are going to be loaded for the same date. There should be the minimun number of Assets avaiable for the selected Entities associated with the pipeline with Mandatory flags. When for the same AssetDate per Asset for all the entities associated with the pipeline has meet this condition, it is told that there is a complete set of data. The granurality of the complete set of data is based on AssetDate and depends on that Sync webjob will take care about the Assets to load in the client application.

EntityPipeline information

The same extraction pipeline can be used to extract several assets from different entities. That can be configured by associating the same pipeline to several entities using the EntityPipeline table.

Column Description
IdPipeline [Required] Identifier of the pipeline
IdEntity [Required] Identifier of the related entity
IsMandatory [Required] If true, assets associated with the IdEntity needs to be as a complete set to run the pipeline. If false, complete data calculation will not take into account this IdEntity. By default, TRUE.
PipelineExecutionProperties [Optional] This a place to put a JSON that could be used in the pipeline for different purposes. By default, NULL, empty or "{}".

Extraction pipelines execution

The extraction pipelines will be launched by the Sync webjob once it has checked the following conditions:

  1. The PipelineSyncBehaviour is not Ignored. The metadata database stores all the pipelines that will be created in the Client app ADF, some of them will be extraction pipelines but others can be created for other tasks.
  2. The pipeline is not associated to a provider through an entity that is disabled.
  3. The pipeline is not being executed.
  4. There are assets ready to be exported i.e. the asset status is MovedToDataLake or ImportingFromDataLake.
  5. If the relation between the pipeline and the entity IsMandatory and the entity is valid -the date of the asset to be exported are between the StartValidDate and EndValidDate of the entity- then there is at least as many assets ready to be exported than the entity FilesPerDrop.

Available pipeline templates

There are several PipelineTemplates provided by Sidra which can be used to accelerate the data movement between the DSU and the client application. The pipelines are provided as-is and with the only purpose that are defined for. The platform is open to create your own pipelines with your own activites, as the same that happens in Core but here your pipeline should have the defined parameters explained in previous points to allow that can be raised by Sync webjob.

Current pipeline templates available are: | ItemId | Purpose | ---------------- | --- | | BF367329-ABE2-42CD-BF4C-8461596B961C | Extract assets to the storage account and then to the staging tables in the database with the support to scale-up and scale-down the database. | | 19C95A0E-3909-4299-AEE1-15604819E2B0 | Extract assets to the storage account and then to the staging tables in the database. | F5170307-D08D-4F92-A9C9-92B30B9B3FF1 | Extract assets to the storage account and execute a notebook

SQL Staging Table generation

By default, all the pipeline templates that are dropping the content to a SQL, the destination staging table is auto created by the own pipeline with the execution of a stored procedure provided by Sidra. Sidra will create the table with the same attributes defined in the Entity and with the same order and types except for SQL_VARIANT, VARBINARY, GEOGRAPHY and UNIQUEIDENFITIER which will be converted as NVARCHAR(MAX) for the first three and VARCHAR(16) for UNIQUEIDENTIFIER in order to guarentee the compatibily between the parquet files and the ADF Copy activity.

As the tables are recreated in every load, they don't include keys or indexes. These changes can be done inside the orchestrator stored procude that needs to be done to ensure that the tables are optimimized for the queries that will be executed.

Default pipeline template for extraction with database scale

There is a default pipeline with the ItemId BF367329-ABE2-42CD-BF4C-8461596B961C that can be used to extract content. This pipeline will drop the content to the client application storage account and to a table in the SQL database created in the client application as well. In addition, the pipeline will scale-up and scale-down the database to provide the required capacity in order to optimize the data movement and stored procedure executions. The template includes the parameters described before to work with the Sync webjob and it provides some mandatory parameters that are required to populate in the ExecutionParameters:

  • storedProcedureName: The stored procedure that is going to be invoked after all the entities have been dropped the content.
  • scaleUpCapacity: The scale up capacity for the database.
  • scaleUpTierName: The scale tier capacitiy for the database. For scaleUpCapacity and this one, please check Database Editions List.

The pipeline invokes to Sidra API to perform the extraction of the content. The content will be dropped in the storage account with parquet format.

For example following ExecutionParameters section:

1
2
3
4
5
{
    "storedProcedureName": "[schema].[StoredProcedure]",
    "scaleUpCapacity":"10", 
    "scaleUpTierName":"S0" 
}

Default pipeline template for extraction

There is a default pipeline with the ItemId 19C95A0E-3909-4299-AEE1-15604819E2B0 that can be used to extract content. This pipeline will drop the content to the client application storage account and to a table in the SQL database created in the client application as well. The template includes the parameters described before to work with the Sync webjob and it provides some mandatory parameters that are required to populate in the ExecutionParameters:

  • storedProcedureName: The stored procedure that is going to be invoked after all the entities have been dropped the content.

The pipeline invokes to Sidra API to perform the extraction of the content. The content will be dropped in the storage account with parquet format.

For example following ExecutionParameters section:

1
2
3
{
    "storedProcedureName": "[schema].[StoredProcedure]"
}

Pipeline template for a Notebook execution

This is the pipeline with ItemId F5170307-D08D-4F92-A9C9-92B30B9B3FF1 that can be used to extract content and execute a Databricks Notebook once the content is stored in the storage account. This can be used with DataLab Client App template.

The parameters required for using this pipeline are:

  • orchestratorNotebookPath: The path of the Notebook to execute. The Notebook should be previously uploaded into the Databricks instance.
  • entityName: A parameter to provide the name of the entity (or empty)
  • pipelineName: A parameter to provide the name of the pipeline which is being executed (or empty)

Both entityName and pipelineName are parameters that are going to provided to the Notebook, but depending on the Notebook they can be useful or not.

For example the ExecutionParameters section will be:

1
2
3
4
5
{
    "orchestratorNotebookPath": "/Shared/MyNotebook",
    "entityName": "myentity",
    "pipelineName": "mypipeline"
}