How binaries are ingested into Databricks¶
Sidra can ingest binary files - PDF or Word documents, media files, etc.- as part of the ingestion process being also indexed for improved search capabilities.
The ingestion of binaries via the Knowledge Store is an automatic process. The result of the indexing - the projections - is stored in Databricks. The projections are automated tagging and index dimensions, those pieces of system generated insights as a result of a knowledge mining activity in Azure Search, e.g. automated labelling, processed text, etc.
In order to automate the process in Sidra, the system performs a background task that ingests any file landed in a specific container of the DSU Stage Azure Storage. If files are placed in the monitored container, a background job triggers an associated pipeline, configured for the Entity to which the Assets belong to. One result of this pipeline is that the file indexing is "enriched" with Projections.
Besides the specifics of content indexing, the platform considers a binary file as an Asset, just as any other data from other sources.
Steps for binary Asset ingestion¶
This section describes the steps to show how the binary Asset (file) flows into the platform, until it finishes in the Data Storage Unit (DSU) and Databricks.
Step 1. File landing¶
The binary document is placed in the monitored "landing" container; specifically, the indexlanding
container of the Azure Storage Account in the DSU containing "stage" in its name. The document file becomes an Azure Storage Blob.
For example, a Data Factory pipeline could be set to periodically copy document files from a SharePoint site to the indexlanding
container, in a folder named <provider>/<entity>
.
Step 2. Asset registration¶
The background task, Hangfire job, is listening to the indexlanding
container, retrieving a batch of blobs once they are copied there.
The blob naming must meet a path convention, matching the regular expression of the Entity, to easily identify the Entity associated with the Asset representing the file. Once the Entity is determined, the associated pipeline configured can be triggered. This is not a Data Factory pipeline - it should be of type AzureSearchIndex
-, based on the Pipeline Template AzureSearchIndexerSemanticSearch
.
In this step, Sidra's task will:
-
Register the file blob as Asset:
- An entry is added in Sidra Service DB
[DataIngestion].[Asset]
table; initially, with[IdStatus]=8
. - This is done by calling the Sidra API for file registration.
- An entry is added in Sidra Service DB
-
Move the Asset blob from the
indexlanding
container to its final destination:- Sidra DSU Stage Azure Storage, in the container having the same name as the Entity Provider.
- This is called the raw copy of the file.
-
Invoke the Azure Search Indexer to start processing the batch of blobs.
The background task, the Hangfire job, runs every 5 minutes.
Step 3. Asset indexing¶
The Azure Search is processing the Asset file with its indexers:
- One is indexing the SQL record of the Asset.
- One is indexing the content of the file for Cognitive Search. The Skillset of this indexer will increase the search capabilities.
The Azure Search Skills are called to process, extract knowledge and index the file. The resulting index projections are going to be stored in Databricks.
Step 4. Knowledgestore¶
The final step is to store in the DSU the projections obtained as a result of the indexer execution. The generated projections for an Asset are retrieved from the indexer and stored in a designated container from the DSU Stage Azure Storage Account; specifically, the knowledgestore
container. Each indexed Asset will have its own folder in that container; the JSON blob there holds the projections.
The ingestion is performed by the execution of a notebook in the Databricks cluster; see the next and last step.
Step 5. Databricks¶
This is the final step of the binary file ingestion: by running a Databricks notebook called AzureSearchKnowledgeStoreIngestion
, the Asset entry - blob location and projections - ends up as an entry in the data lake table corresponding to the Provider's Entity.
The instructions in this notebook will actually move the data to Databricks when the indexer status is succeeded. Find details about this script in Knowledge Store Ingestion.
The Asset record in the Sidra's Core DB is set to [IdStatus]=2
, fully ingested, after this final step.
Changes in the system after the Asset ingestion¶
After the last step for file registration is executed, the platform will have been updated as it follows:
- A new Asset is added to the platform and a record about it is included in the Sidra Service metadata DB.
- A raw copy of the Asset is stored in the Provider's container, in the DSU's "Stage" Azure Storage.
- The file is indexed in Azure Search and its entry may be found querying the index, by Asset ID. Example:
$filter=Id eq '12345'
. - The projections about the file are included in the DSU's:
- "Stage" Azure Storage, container
knowledgestore
. - Databricks, Provider's Entity table.
- "Stage" Azure Storage, container