Skip to content

Knowledge store ingestion notebook

Previous section about how binaries are ingested describes the general process for data ingestion of binary files using the Knowledge Store functionality in Sidra for knowledge mining. The main output of the processing of binary files in Sidra Knowledge Store is the generation of a set of projections for the file. Such projections will just be whatever index structures are defined by the specific Azure Search pipeline associated to the Entity configured for the ingested files. The association of files to an Entity in Sidra is covered in this tutorial. Projections will be generated in an index (JSON structure), which is the result of an indexer. You can check more details on this on general Azure Search concepts page.

This page describes the specific data ingestion part into Data Lake Gen2 storage (Databricks) for the generated projections. The steps for ingesting the projections for a file into Databricks are performed by the execution of the notebook AzureSearchKnowledgeStoreIngestion.

This script performs several actions, including the below steps:

1. Retrieve secrets

In order to have access to the API to perform some request the notebook retrieves some secrets that are configured in Databricks by the deployment project.

2. Retrieve Assets

Next thing is to retrieve the Assets currently indexed by Azure Search to locate their projections in the Knowledge Store. The API will return all Assets whose state is IndexingInAzureSearch.

3. Create database and tables

As part of the ingestion the database and tables in Databricks are created if the Entity it is not deployed or the field RecreateTableOnDeployment it is set to true. For each of the Entities of the ingested Assets, it is checked if the Databricks table is created or not. If this table is not created, the table creation script, CreateTables is executed. This script creates the necessary database and tables in the Databricks cluster to store the information obtained by the indexer. More details can be found in the Table creation section.

4. Retrieve projections schema

  • The schema where the projections are is retrieved for the table. This schema is stored in the AdditionalProperties column for the Entity, in the field SearchServices. Important fields for this schema are the columns for the FilePath and for the Projections. With this schema, the Projections column is received. The table row definition will include the following columns:

  • Path

  • Projections
  • SidraIdAsset = asset.id
  • SidraFileDate = asset.asset_date
  • SidraPassedValidation
  • SidraLoadDate = CurrentDate

5. Check the status for the indexer and merge to the Delta table in Databricks

The status of the indexer is checked to ensure it has finished processing. For each Asset the Entity it is known, so this Entity is used to check the state of its associated processing indexer. If the indexer has finished to index the file, the projections are stored in a container of Azure Storage Account. The execution of the notebook locates the information for the Asset, retrieves the projections. and stores them in the Databricks tables.

The indexer should be in SUCCESS state (so, not in error state or still running), for the script to continue for the Assets. If that is the case, then a dataframe is created and the results merged to the Delta table.

6. Change Asset status

The last step is to change the status of the Asset to MovedToDataStorageUnit. Also, the number of Entities, and size in bytes are updated for the successful Assets.


Last update: 2024-04-08