Skip to content

Concepts in Data Intake with document indexing

This page is intended to set out some considerations related to Data Intake with document indexing.

File registration for document indexing's Data Intake

The file registration is the process of declaring an Asset in the platform; the Asset represents the file to be ingested in the Data Storage Unit (DSU).

This process is described as a general concept and in more detail in the page about ADF pipelines and how Assets are ingested.

The Assets are registered using the Sidra API and the process encompasses the following steps:

  1. Identify the Entity to which the file belongs.

    • Every Entity in the metadata Core DB has the RegularExpression column, a pattern of the blob names;
    • Assets with blobs matching the RegEx will belong to the associated Entity.
    • Additionally, the blob path must meet convention allowing to filter the by patterns.
  2. The Entity record points to the DSU where the blobs will be stored: Azure Storage account and Databricks / Data Lake.

The path convention for the files in the indexing landing zone can be defined in two possible structures:

  • {provider-name}/{entity-name}/{file-with-date}
  • {provider-name}/{entity-name}/{year}/{month}/{day}/{file}

Considering an example where the Provider is called SharePoint and the Entity is Documents, we would have:

  • Landing, where file is fed: DsuStageAccount: indexlanding/sharepoint/documents/Meeting-notes.pdf

  • Final raw copy: DsuStageAccount: sharepoint/documents/Meeting-notes_id12345.pdf

  • KnowledgeStore, search projections: DsuStageAccount: knowledgestore/12345/FullData.json

Entity-to-Indexer

By knowing the Entity to which the Asset belongs, the selection of the indexer pipeline is based on the relation between Pipelines and Entities. See the [EntityPipeline] table in Sidra's Core metadata DB; this is used to retrieve the indexer pipeline related to the Entity.

The pipelines used for binary file indexing are created out of pipeline templates of type AzureSearchIndexer. The pipeline template defines the skillset, indexer, and index fields. The skills can be out-of-the box Azure Search capabilities, but also any kind of custom skill: an API, a model inference endpoint, etc.

The last step of the background task registering the files is to execute the indexer. The indexer is going to crawl all the files associated to the same Entity. If the indexer is not deployed yet, it will be done in this step before executing the indexer itself.

Azure Search artifacts

These artifacts are created when an indexer is deployed. The indexer pipeline is deployed based on a template definition.

  • Index: This is an structure like a database table that holds the processed data - document entries - and can accept search queries.
  • Data Source: This object is what instructs Azure Search on the location and nature of the data; as a comparison could be said that it is like a connection string.
  • Indexer: This object crawls the data from the data source and uses a predefined schedule to push the indexed and analyzed data to the Index.
  • Skillset: Additional functionality helping the Indexer to enrich the searching capabilities by adding Projections.

Once the Azure Search type of pipeline is deployed, the created artifacts will be seen in the Azure Search resource of the DSU.

Read more about these artifacts in the Microsoft's docs, Azure Search concepts.


Last update: 2024-02-23