Sidra API module: Data Querying

Query

This tutorial has the goal of showcase how to use the Query API in Sidra. Query API allows an user or system to access to a DSU data if it has enough privileges for it.

The tutorial will cover the three basic steps for the Query API usage. They consist of authentication, querying the DSU and poll the result.

During the whole tutorial we are going to use Postman for simplicity’s sake, but feel free to use any other tool/programming language.

1. Sidra Core API authentication

To successfully authenticate in the Sidra API you will need: - The Access Token URL which can be built by concatenating the Identity Server URL plus adding the suffix “/connect/token”. It will look like this: https://youidentityserver.azurewebsites.net/connect/token - A Client Id. This client Id can be granted for a given user or to a client application. In any case, this Client Id needs permissions on the DSU where the data is going to be pulled off. - Client Secret. The password for the Client Id described below. It is used in the Client Credentials flow used by Identity Server - Scope: Use “plainconcepts.sidra.api"

So, in Postman, create a new query, go to authentication tab, choose "OAUTH 2.0" type and the click on "Get New Access Token". The provide the details described above and it should look like this:

get-token-config

Once you get the token now we can proceed to the next step, querying the API.

2. Using the Query API

Once you are authenticated, it is time to set up the query to pull data from the DSU. This endpoint works as it follows: 1. The API will process your request and it will queue a Databricks job 2. The API will response with a 202 Accepted if everything is OK and that response will have a header named "location" with a URL with the polling token (we will go through these in the next step) 3. The API requires that the user provides a Azure Blob Storage SAS token where the data will be dumped.

The query API also supports two different output formats: csv and parquet, in this tutorial we will use the parquet endpoint. Further details about endpoints and so can be found in the API Swagger site (https://yousidracoreapi.azurewebsites.net/swagger/index.html)

As it can be seen in the Swagger, the parquet endpoint for the query service(/api/Query/entity/{idEntity}/parquet) accepts the following parameters (as of 2020.R2 version):

query-api-parquet-params

An example of how to fill these parameters is:

query-api-parquet-params-filled

Some considerations about the request-response: - You must have access to the storage you provide. You need to get a SAS token where the data is going to be extracted (storageToken param). - The storageToken param must be URL Encoded - IdSourceItems param relates to the IdAsset (separated by commas) that are going to be extracted - If the polling token is generated the response will be 202 Accepted - The client id used to get the access token needs to have privileges on the DSU/provider/entity

The polling token URL is posted in the Location header. So, before moving to the next step, copy the URL and open a new tab in Postman.

3. Polling the API

In order to provide an asynchronous way of knowing if the data has been already extracted to the storage account, Sidra API provides the endpoint

Querying this endpoint produces, quite often, three outcomes - 200 OK: The data is already extracted and can be found in the container and folder that the user provided

polling-200-code

  • 202 Accepted: The processing is still going on, keep polling the URL until you get a 200 OK

polling-202-code

  • 500 ERROR: Something went wrong. Common errors are: invalid attributed, no permissions for the entity, invalid assetId

polling-500-code

Once you get the 200 OK code, go to the storage account and the file can be found there. For instance:

file-in-blob-storage-query-api

Summarizing, Query API service provides a way of getting data from the DSU, enforcing the security, in an asynchronous manner. This API is being used internally in Sidra for all the client applications in the extraction process.

Needlessly to say, this tutorial is based on the API using postman but feel free to use any tool or programming language. Also, Sidra 2020.R2 version comes with a python library named pysidra which provides a wrapper for the whole Core API and also features related to ML model serving, etc. Regarding this tutorial, the querying and polling endpoints are integrated in the library and available to anyone who might use it.