Sidra Client pipelines

Further concepts on Client Application pipelines

The Pipeline table in Client Application metadata database has some additional fields regarding the metadata tables in Core, as described in the Concepts section.

PipelineSyncBehaviour table

PipelineSyncBehaviour. This is a table that defines the different type of behaviour that is going to be executed from the Sync webjob perspective. Sync webjob is the responsible to synchronize metadata and execute the pipelines associated with the Entities. Table Pipeline contains a field pointing to this behavior.

When a pipeline is being executed, the Sync webjob will provide several parameters to the pipeline. Depending on the sync behaviour, the pipelines will be:

  • Ignored: Pipeline will be ignored by the Sync webjob, so probably that pipeline should be executed explicitly using a Trigger.
  • LoadAllNewDates: All completed date Assets (see below on complete set of data) will be included to be extracted by Sync webjob execution of the pipeline.
  • LoadFirstNewDate: Only the oldest set of Asset data completed will be included to be extracted by Sync webjob execution of the pipeline.
  • LoadLastNewDate: Only the newer set of Asset data completed will be included to be extracted by Sync webjob execution of the pipeline.

ExecutionParameters field

  • ExecutionParameters. This is a field in the Pipeline table in Client Applications. This field is used for including values in the parameters of the pipeline when this pipeline is executed by the Sync webjob.

Sync pipeline requirements

All the pipelines called by the Sync webjob (all except with behaviour Ignored) require using a PipelineTemplate with some required parameters, apart from the ExecutionParameters.

The values of the pipeline parameters will be populated automatically by the Sync webjob, and you can see them in the Template field from the PipelineTemplate table:

  • storageName: The storage name used to drop the content. This storage account is created by the deployment of the Client Application.
  • storageToken: The SAS container token of the storage used.
  • containerName: The storage container name where the content will be dropped.
  • queryParameters: Some parameters are populated here by the Sync webjob, which include the Assets that are going to be extracted per Entity from the Data Storage Unit.
  • overwrite: By derault this value is True. This parameter can be overriden in the ExecutionParameters field for the pipeline to indicate whether the content will be either appended, or cleaned and dropped.

From the point of view of the PipelineTemplate, an example of the section JSON (Template field) with the parameters will be the following (minimun required):

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
 "parameters": {
    "storageName": {
        "type": "string"
    },
    "storageToken": {
        "type": "string"
    },
    "containerName": {
        "type": "string"
    },
    "queryParameters": {
        "type": "array"
    },
    "overwrite": {
        "type": "string",
        "defaultValue": "true"
    }
}

There may be more parameters, depending on the PipelineTemplate.

Complete set of data

When an Entity is defined, there is a field named FilesPerDrop (further info in metadata description, which refers to how many files are going to be loaded for the same date for that given Entity.

For exampple, let's suppose that for a given Entity, every drop of data in Sidra Core requires at least three different files to be dropped to the landing zone and ingested into Sidra.

This means that, for considering the data to be complete there should be at least the minimum number of Assets available for the selected Entities, three Assets following the example.

In summary, the conditions to fulfill a complete set of data for a pipeline to run are the following:

  • The Entities are associated with the pipeline (through the Entity-pipeline association, see below on EntityPipeline information).
  • The EntityPipeline relationship has been tagged as Mandatory. See the mandatory Assets section for detailed description of the Mandatory Assets feature.
  • The AssetDate fields for each Asset for all the Entities associated with the pipeline meet the condition to be between the StartValidDate and EndValidDate for the Entity.

The granurality of the complete set of data is based on the field AssetDate and depends on the Sync webjob behavior (see above in this page).

EntityPipeline information

The same extraction pipeline can be used to extract several Assets from different Entities. That can be configured by associating the same pipeline to several Entities using the EntityPipeline table.

Column Description
IdPipeline [Required] Identifier of the pipeline
IdEntity [Required] Identifier of the related Entity
IsMandatory [Required] If true, Assets associated with the IdEntity need to be as a complete set to run the pipeline. If false, complete data calculation will not take into account this IdEntity. By default, TRUE.
PipelineExecutionProperties [Optional] This a place to put a JSON that could be used in the pipeline for different purposes. By default, NULL, empty or "{}".

Extraction pipelines execution

After understanding all the above concepts for Client Application pipelines behavior, as well as the ones in the Concepts section, we can summarize with the following:

The extraction pipelines will be launched by the Sync webjob once it has checked the following conditions:

  1. The PipelineSyncBehaviour is not Ignored.
  2. The pipeline is not associated to a Provider through an Entity that is marked as disabled.
  3. The pipeline is not being executed.
  4. There are Assets ready to be exported i.e., the AssetStatus is MovedToDataLake or ImportingFromDataLake.
  5. If the relationhip between the pipeline and the Entity contais the flag IsMandatory= true, and the date of the Asset to be exported are between the StartValidDate and EndValidDate of the Entity.
  6. There is at least as many Assets ready to be exported as the number of Assets stored in the Entity field FilesPerDrop.

Available pipeline templates

There are several PipelineTemplates provided by Sidra which can be used to accelerate the data movement between the DSU and the Client Application. The pipelines are provided as-is and with the only purpose that are defined for. The platform is open to create your own pipelines with your own activites, as the same that happens in Core but here your pipeline should have the defined parameters explained in previous points to allow that can be raised by Sync webjob.

Current pipeline templates available are:

ItemId Purpose
BF367329-ABE2-42CD-BF4C-8461596B961C Execute an extraction from DSU to the client storage and staging tables, scaling up/down the selected database during the process.
19C95A0E-3909-4299-AEE1-15604819E2B0 Execute an extraction from DSU to the Client Appplication storage and staging tables.
F5170307-D08D-4F92-A9C9-92B30B9B3FF1 Extract assets to the storage account and execute a notebook.

SQL Staging Table generation

By default, all the pipeline templates that are dropping the content to an SQL database, the destination staging table is auto-created by the data extraction pipeline, via the execution of a stored procedure provided by Sidra.

Sidra will create the table with the same Attributes defined in the Entity, and with the same order and types except for SQL_VARIANT, VARBINARY, GEOGRAPHY and UNIQUEIDENFITIER, which will be converted as NVARCHAR(MAX) for the first of these three types, and VARCHAR(16) for UNIQUEIDENTIFIER. This is into guarantee the compatibily between the parquet files and the Azure Data Factory Copy activity.

As the tables are recreated in every load, they don't include keys or indexes. These changes need be explicitly included inside the orchestrator stored procedure called by the pipeline. This stored procedure also needs to ensure that the tables are optimimized for the queries that will be executed.

Default pipeline template for extraction with database scale

There is a default pipeline template with the ItemId BF367329-ABE2-42CD-BF4C-8461596B961C that can be used to extract content. This pipeline will drop the content to the Client Application storage account and to a table in the SQL database created in the Client Application as well.

In addition, the pipeline will scale-up and scale-down the database to provide the required capacity in order to optimize the data movement and stored procedure executions.

The pipeline invokes to Sidra API to perform the extraction of the content to the Client Application storage. The content will be dropped in the storage account with .parquet format. The content is also copied to the staging tables via an Azure Data Factory copy activity.

The template includes the parameters described before to work with the Sync webjob and it provides some mandatory parameters that are required to populate in the ExecutionParameters:

  • storedProcedureName: The stored procedure that is going to be invoked after all the Entities have dropped the content.
  • scaleUpCapacity: The scale up capacity for the database.
  • scaleUpTierName: The scale tier capacitiy for the database. For scaleUpCapacity and this one, please check Database Editions List.

For example following ExecutionParameters section:

1
2
3
4
5
{
    "storedProcedureName": "[schema].[StoredProcedure]",
    "scaleUpCapacity":"10", 
    "scaleUpTierName":"S0" 
}

In addition, another stored procedure (orchestrator stored procedure) is called within this pipeline that inserts the tracked execution values in the ExtractPipelineExecution table.

Default pipeline template for extraction

There is a default pipeline template with the ItemId 19C95A0E-3909-4299-AEE1-15604819E2B0 that can be used to extract content.

This pipeline will drop the content to the Client Application storage account and to a table in the SQL database created in the Client Application as well. The pipeline invokes to Sidra API to perform the extraction of the content to the Client Application storage. The content will be dropped in the storage account with .parquet format. The content is also copied to the staging tables via an Azure Data Factory copy activity.

The template includes the parameters described before to work with the Sync webjob, and it provides some mandatory parameters that are required to populate in the ExecutionParameters:

  • storedProcedureName: The stored procedure that is going to be invoked after all the Entities have dropped the content.

For example following ExecutionParameters section:

1
2
3
{
    "storedProcedureName": "[schema].[StoredProcedure]"
}

In addition, another stored procedure is called within this pipeline that inserts the tracked execution values in the ExtractPipelineExecution table.

Pipeline template for a Notebook execution

This is the pipeline template with ItemId F5170307-D08D-4F92-A9C9-92B30B9B3FF1 that can be used to extract content and execute a Databricks Notebook once the content is stored in the storage account. This can be used with DataLab Client App template.

If the notebook execution is succedded, the assets involved will be update the status to 2; otherwise to 0.

The parameters required for using this pipeline are:

  • orchestratorNotebookPath: The path of the Notebook to execute. The Notebook should be previously created and uploaded into the Databricks instance.
  • entityName: A parameter to provide the name of the Entity (or empty).
  • pipelineName: A parameter to provide the name of the pipeline which is being executed (or empty).

Both entityName and pipelineName are parameters that are going to provided to the Notebook, but depending on the Notebook they can be useful or not.

For example the ExecutionParameters section will be:

1
2
3
4
5
{
    "orchestratorNotebookPath": "/Shared/MyNotebook",
    "entityName": "myentity",
    "pipelineName": "mypipeline"
}