About the Sharepoint online library connector plugin¶

≥ 2022.R1

The Sidra connector plugin (a.k.a. connector) for Sharepoint online library enables seamless connection to a Sharepoint online collection of documents created as a Sharepoint library.

Sidra's connector plugin for creating Data Intake Processes from Sharepoint online library is responsible for extracting documents from inside a Sharepoint library and integrating further with Sidra.

You can see more details of what is a Data Intake Process in this page.

This plugin has been developed to cover differently for two distinct scenarios:

When the destination container option is set to Landing Zone for Azure Search, the plugin will configure a full end to end binary file data ingestion. This is only applicable to binary file documents (e.g., PDFs). In this case, the configured Data Intake Process with this plugin will perform the following steps:
1. Extract the data (documents) from the source (Sharepoint online library).
2. Process these documents through the selected indexing pipelines (Azure Search) and store the insights generated data in the Knowledge Store. The available pipelines of Azure Search are:
  - AzureSearchBasicIndexer. This is a template for an Azure Search Basic Indexer, including Index and Skillset. It can be related with a Dataset template for the DataSource.
  - AzureSearchIndexerNER. This is a template for an Azure Search Named Entity Recognition Indexer, including Index and Skillset. It can be related with a Dataset template for the DataSource.
  - AzureSearchIndexerSemanticSearch. This is a template for an Azure Search Indexer, that performs SemanticSearch.
3. Ingest the projections (generated insights from the indexing process) for the ingested Assets (documents) into tables in Databricks (DSU ingestion).
Internally in Sidra, this means that two types of pipelines will be involved:
- The ADF pipelines for extracting the raw data from Sharepoint at specified intervals to the indexing landing folder in Sidra.
- The Azure Search pipelines for taking the data from the indexing landing folder, indexing the data via Azure Search, and finally ingesting the processed data in the DSU (Databricks).
When the destination container option is other existing container, the plugin will not configure a complete Data Intake Process, but rather just a data extraction from the source system (Sharepoint) to a landing container in Sidra.
1. When running the plugin wizard, the user can specify the rules for extracting data and for assigning to Entities in Sidra.
2. However, in this case a full end to end data intake is not configured. The user would need to manually create the needed remaining metadata (Attributes and association to ingestion pipeline) to configure a full end to end data ingestion.

Step-by-step

Data intake CSV via Sharepoint connector

For more detailed information about data ingestion with Sharepoint connector you can check the tutorial .

Knowledge store ingestion

For more details on Knowledge Store ingestion see the Sidra documentation .

Asset flow into platform via landing

For more details about data intake via landing, please check on how the general file ingestion from the `landing` folder works: Asset flow via landing documentation .

How Assets are ingested and pipelines needed

For more details on how the general file ingestion the `landing` root folder works, see this documentation .

Sidra Sharepoint plugin relies on the Sidra Metadata Model for mapping source data structures to Sidra Entities and destination paths. The underlying data integration mechanism within Sidra is Azure Data Factory .

When configuring and executing this plugin of type connector, several underlying steps are involved to achieve the following:

The necessary metadata and data governance structures are created and populated in Sidra.
The actual data integration infrastructure (ADF Pipeline) is created, configured, and deployed.
Only in specific circumstances, when choosing Landing Zone for Azure Search as destination path, the necessary configuration for associating the Assets generated to the corresponding Azure Search pipeline is created, as well as for running those Azure Search pipelines. The user can choose from a set of available pipelines of type Azure Search.

This plugin allows the customization of the Entities to create and associate the documents to. This is to associate different file sets to different Entities in Sidra. The user can specify each of these file sets and Entities by using two parameters: source paths and filter expressions. See below on more detailed information in Entities mapping section about these settings and how to associate sets of extracted documents to Entities in Sidra.

Versions

From Sidra 2022.R3 version onwards, the autogenerated Transfer Query will be replaced by the DSU ingestion script. More information can be checked here .

Entities mapping¶

In order to define which source files correspond to which Entity in Sidra, the wizard will include three configuration fields per defined Entity:

Entity Name: You can configure multiple Entities to be created. Each set of files will map to one Entity. The Entity Name is the name of the Entity in Sidra metadata system to assign the extracted set of files. Entity names cannot contain special characters. To configure the set fields related to a new Entity, please add on the Add Entity button.
Source Path: The Source Path represents the relative path in the Sharepoint library from the root folder that defines the set of files mapped to an Entity. All files under this source path and following the Filter Expression field will be extracted and associated to the configured Entity. e.g. "Invoices/March", "Statistics/July".
Filter Expression: A Filter Expression is used to filter a certain set of files to extract and to associate to the configured Entity. Filter expressions follow OData query syntax. . Examples: A Filter Expression like "Name eq 'Invoice.xlsx'" will associate files with an exact name to the configured Entity, under the specified Source Path. A Filter Expression of startswith(Name, 'Invoice') will associate all files whose name starts with a set of characters under the specified Source Path to the configured Entity Name. Several filters can be combined with logical operators (and, or).
Format: this is mapped to the Format Attribute for the Entity. This value is mandatory if the destination path for the files is other existing container. This value is mandatory in this scenario because it is required to configure the created Entities.

When ingesting the data in Sidra, each document is registered as an Asset.