About the Sharepoint online library connector plugin¶
The Sidra connector plugin (a.k.a. connector) for Sharepoint online library enables seamless connection to a Sharepoint online collection of documents created as a Sharepoint library.
This plugin is only available from Sidra Release 2022.R1.
Sidra's connector plugin for creating Data Intake Processes from Sharepoint online library is responsible for extracting documents from inside a Sharepoint library and integrating further with Sidra.
You can see more details of what is a Data Intake Process in this page.
This plugin has been developed to cover differently for two distinct scenarios:
When the destination container option is set to
Landing Zone for Azure Search, the plugin will configure a full end to end binary file data ingestion. This is only applicable to binary file documents (e.g., pdfs). In this case, the configured Data Intake Process with this plugin will perform the following steps:
- Extract the data (documents) from the source (Sharepoint online library).
- Process these documents through the selected indexing pipelines (Azure Search) and store the insights generated data in the Knowledge Store.
- Ingest the projections (generated insights from the indexing process) for the ingested Assets (documents) into tables in Databricks (DSU ingestion).
Internally in Sidra, this means that two types of pipelines will be involved:
The ADF pipelines for extracting the raw data from Sharepoint at specified intervals to the indexing landing folder in Sidra.
The Azure Search pipelines for taking the data from the indexing landing folder, indexing the data via Azure Search, and finally ingesting the processed data in the DSU (Databricks).
When the destination container option is
other existing container, the plugin will not configure a complete Data Intake Process, but rather just a data extraction from the source system (Sharepoint) to a
landingcontainer in Sidra.
- When running the plugin wizard, the user can specify the rules for extracting data and for assigning to Entities in Sidra.
- However, in this case a full end to end data intake is not configured. The user would need to manually create the needed remaining metadata (Attributes and association to ingestion pipeline) to configure a full end to end data ingestion.
Data intake CSV via Sharepoint connector
For more detailed information about data ingestion with Sharepoint connector you can check the tutorial .
Knowledge store ingestion
For more details on Knowledge Store ingestion see the Sidra documentation .
Asset flow into platform via landing
For more details about data intake via landing, please check on how the general file ingestion from the `landing` folder works: Asset flow via landing documentation .
How Assets are ingested
For more details on how the general file ingestion the `landing` root folder works, see this documentation .
Sidra Sharepoint plugin relies on the Sidra Metadata Model for mapping source data structures to Sidra Entities and destination paths. The underlying data integration mechanism within Sidra is Azure Data Factory .
When configuring and executing this plugin of type connector, several underlying steps are involved to achieve the following:
- The necessary metadata and data governance structures are created and populated in Sidra.
- The actual data integration infrastructure (ADF Pipeline) is created, configured, and deployed.
- Only in specific circumstances, when choosing
Landing Zone for Azure Searchas destination path, the necessary configuration for associating the Assets generated to the corresponding Azure Search pipeline is created, as well as for running those Azure Search pipelines. The user can choose from a set of available pipelines of type
This plugin allows the customization of the Entities to create and associate the documents to.
This is to associate different file sets to different Entities in Sidra.
The user can specify each of these file sets and Entities by using two parameters: source paths and filter expressions.
See below on more detailed information in
Entities mapping section about these settings and how to associate sets of extracted documents to Entities in Sidra.
Destination paths and scope of ingestion automation¶
Sidra has different landing folders defined for different types of ingestion.
For example, the folder
indexlanding is used for Knowledge Store ingestion, and the folder
landing is used for general semi-structured file ingestion.
Current Sharepoint plugin version incorporates two possible inputs for specifying the destination path:
Destination path specification
- The plugin will automatically configure a whole end-to-end data extraction and file ingestion process into the data lake, once the right pipeline has been selected in the field "Please select an Azure Search pipeline for indexing".
- For doing this, the underlying Entities and Attributes creation (metadata generation) will be handled transparently by the plugin.
- The Entities will be automatically created as per the fields configured in the metadata extraction step.
- The Attributes will be automatically created, as they can be defined in a standardized way for every Asset of type binary data (unstructured data).
The plugin will automatically associate the created Entities to a generic Azure Search pipeline. This option will therefore only apply to binary file ingestion.
For details on Knowledge Store ingestion see the Sidra documentation .
For the data extraction process, the trigger configuration step will be used to determine the periodicity of extracting data from the source.
- Whenever files are deposited to the Landing Zone for Azure Search, there will be an internal trigger in Sidra to launch transparently the binary file ingestion process.
- The user will be prompted to select from a list of available storage containers in Sidra, where the extracted documents from the source will be deposited.
- In this case, there won't be any file ingestion process, and the plugin will just cover for the pure data extraction part.
- This applies to semi-structured files.
In order to configure a full end-to-end data ingestion process in the data lake, the user will need to execute manually the processes of metadata generation (creation of Attributes) and the association of the Entities to the data ingestion pipeline (
For making this possible, if the
other existing containeroption is selected, the user will be asked to enter an additional parameter called
Formatfor each configured Entity. This will correspond to the
Formatfield in the Entity metadata model, which is needed for a further data intake processing in Sidra.
For more details on how the general file ingestion the
landingroot folder works, see this documentation .
Thanks to this plugin, the Data Intake Process is configured in less than five minutes. Once the settings are configured and the deployment process is started, the actual duration of the data ingestion may vary from few minutes to few hours, depending on the data volumes.
- After starting the Data Intake Process creation, users will receive a message that the process has started and will continue in the background. Users will be able to navigate through Sidra Web as usual while this process happens.
- Once the whole deployment process is finished, users will receive a notification in Sidra Web Notifications widget. If this process went successfully, the new data structures (new Entity) will appear in the Data Catalog automatically, and the Data Intake Process will incorporate this new data source.
In order to define which source files correspond to which Entity in Sidra, the wizard will include three configuration fields per defined Entity:
- Entity Name: You can configure multiple Entities to be created. Each set of files will map to one Entity. The Entity Name is the name of the Entity in Sidra metadata system to assign the extracted set of files. Entity names cannot contain special characters. To configure the set fields related to a new Entity, please add on the
- Source Path: The Source Path represents the relative path in the Sharepoint library from the root folder that defines the set of files mapped to an Entity. All files under this source path and following the Filter Expression field will be extracted and associated to the configured Entity. e.g. "Invoices/March", "Statistics/July".
- Filter Expression: A Filter Expression is used to filter a certain set of files to extract and to associate to the configured Entity. Filter expressions follow OData query syntax. . Examples: A Filter Expression like "Name eq 'Invoice.xlsx'" will associate files with an exact name to the configured Entity, under the specified Source Path. A Filter Expression of
startswith(Name, 'Invoice')will associate all files whose name starts with a set of characters under the specified Source Path to the configured Entity Name. Several filters can be combined with logical operators (and, or).
- Format: this is mapped to the
FormatAttribute for the Entity. This value is mandatory if the destination path for the files is
other existing container. This value is mandatory in this scenario because it is required to configure the created Entities.
When ingesting the data in Sidra, each document is registered as an Asset.
The process of setting up a Sharepoint Online Library Data Intake Process involves several key actions. The different sections of the input form are organized into these main steps:
Step 1. Configure Data Intake Process¶
Please see on the common Sidra connector plugins page about the parameters needed to configure the fields for a Data Intake Process.
Step 2. Configure Provider¶
Please see on the common Sidra connector plugins page about the parameters needed to create a new Provider.
Step 3. Configure Data Source¶
The data source represents the connection to the Sharepoint online library. A Data Source abstracts the details of creating a Linked Service in Azure Data Factory. The fields required in this section are the following:
- Choose an Integration Runtime to use.
- Site URL: this is the URL of the SharePoint Online site including the library name where the files are hosted. For example, https://contoso.sharepoint.com/sites/siteName/LibraryName.
- Tenant ID: The tenant ID under which your application resides. You can find it from Azure portal Active Directory overview page.
- Service Principal ID: The tenant ID under which your application resides. You can find it from Azure portal Active Directory overview page.
- Service Principal key: The tenant ID under which your application resides. You can find it from Azure portal Active Directory overview page.
Sidra connector plugin for Sharepoint online library will register this new data source in Sidra Metadata and deploy a Linked Service in Azure Data factory with this connection. Default value for Integration Runtime is AutoResolveIntegrationRuntime.
For more details on Linked Services check the Data Factory documentation.
Step 4. Configure Metadata Extractor¶
Sidra connector plugin for Sharepoint online library creates the needed metadata and orchestrates the data integration. To configure this, several fields are required:
- Destination Path: The storage destination path where the Please choose one from the list of available Azure Search pipeline templates. A new pipeline of the selected type will be created to configure the full file indexing and ingestion. files from the Sharepoint library will be deposited. See section Destination paths and scope of ingestion automation and Entities mapping for details about this field.
- Type of Azure Search pipeline to deploy: A pickup field from a list of available Azure Search pipeline templates. A new pipeline of the selected type will be created to configure the full file indexing and ingestion.
- Entities mapping fields: Entity Name, Source Path, Filter Expression and Format. See section Entities mapping for details about these fields.
Step 5. Configure Trigger¶
Please see on the common Sidra connector plugins page about the parameters needed to set up a trigger.
Supported data synchronization mechanisms¶
The Sidra connector plugin for Sharepoint library supports a full load synchronization mode and an incremental load synchronization mode. The incremental load mode uses the change tracking mechanism implemented by Sidra.
This incremental load is supported by the
EntityDeltaLoad table in Sidra Core metadata database. This table is described here.
LastDeltaValue in the
EntityDeltaLoad table is used to track the last value of the last modified date of the documents in the Sharepoint online library.
The query to extract the documents checks if there is a
LastDeltaValue in the
If the value is present, the synchronization mode is incremental, so the pipeline queries for all documents whose last modification date is greater than the value of
LastDeltaValue. The query also filters documents whose last modification date is less or equal to the last
ExecutionDateof the pipeline.
If the value is not present, the synchronization mode is full, which means that the query filters by the last modification date of the documents.