Choose the Content to Index
This section explains how to configure the connector so that it retrieves the content that you want to index, and nothing else.
Restrict the Content to Process
The content in SharePoint is organized in the following structure.
Web Application (on-premise only) |- Site Collection |- Site |- Site | |- ... |- Document Library | |- File | | |- File version(s) | |- Folder | |- File | |- File version(s) |- List |- List Item |- Attachment(s) |- Folder |- List Item |- Attachment(s)
There can be multiple site collections, and multiple sites within a site or site collection. There can be multiple lists and document libraries within a site, multiple folders and files within a document library, and so on.
NOTE: Instances of SharePoint Online have a single site collection at the root level, and no concept of Web Applications.
You can restrict the content to retrieve by setting the following configuration parameters:
SiteCollectionUrlCantHaveRegex
andSiteCollectionUrlMustHaveRegex
restrict the site collections to process, when you are retrieving information from a Web Application (SharePointUrlType=WebApplication
).-
SiteUrlCantHaveRegex
andSiteUrlMustHaveRegex
restrict the sites to process. If a site is excluded the connector does not index a document for the site and ignores all lists, list items, files, file versions, and attachments contained by the site.NOTE: If a site is excluded, its child sites are still processed, unless they are also excluded by one of the regular expressions.
-
ListUrlCantHaveRegex
andListUrlMustHaveRegex
restrict the lists and document libraries to process. If a list or document library is excluded, the connector does not index a document for the list or document library and ignores all list items, files, file versions, and attachments contained in the list or document library. -
ListItemUrlCantHaveRegex
andListItemUrlMustHaveRegex
restrict the list items or files to process. If a list item or file is excluded, the connector does not index a document for the list item or file, and ignores all file versions and attachments for that list item or file. -
IndexAttachments
specifies whether to index attachments for list items that are processed. IndexFileVersions
andVersionIndexingMode
specify which file versions to index for files that have versioning enabled.FileExtnCantHaveCSVs
andFileExtnMustHaveCSVs
restrict the files and attachments to process, based on their file extension.IndexSites
,IndexLists
, andIndexFolders
specify whether to index a metadata-only document for each container object (sites, lists, and folders respectively).
The connector performs best if you choose the objects to process at the highest possible level. Take for example the following structure:
http://sharepoint/ Site Collection http://sharepoint/site1/ Site http://sharepoint/site1/List1 List http://sharepoint/site1/List1/Item1 List Item http://sharepoint/site1/List1/Item2 List Item http://sharepoint/site1/List2 List http://sharepoint/site1/List2/Item1 List Item http://sharepoint/site1/List2/Item2 List Item http://sharepoint/site2/ Site http://sharepoint/site2/List1 List http://sharepoint/site2/List1/Item1 List Item http://sharepoint/site2/SubSite/ Site http://sharepoint/site2/SubSite/List1 List http://sharepoint/site2/SubSite/List1/Item1 List Item
You could ignore all content from site1
by configuring ListUrlCantHaveRegex=http://sharepoint/site1/.*
, but the connector would have to process site1
, and all of the lists on that site, just to determine that the lists should be ignored. A more efficient configuration is SiteUrlCantHaveRegex=http://sharepoint/site1/
, because the connector can immediately determine that nothing from that site has to be processed.
Similarly, you could ignore content on site2
, but still index content on site2/subsite
, by configuring ListUrlCantHaveRegex=http://sharepoint/site2/List.*
. However, the connector would have to process site2
and all of the lists on that site, just to determine that the lists should be ignored. A more efficient configuration would contain SiteUrlCantHaveRegex=http://sharepoint/site2/$
, so that the connector can immediately determine that nothing from site2
has to be processed. The URL for site2/subsite
does not match the regular expression http://sharepoint/site2/$
, so content from site2/subsite
is still processed.
Index Content that does not appear in Search Results
In SharePoint, a user can choose whether to allow items from a list or document library to appear in search results. Users can also choose whether to allow publishing pages (a type of list item) to appear in search engine results, using Search Engine Optimization (SEO) settings. In both cases, by default, the connector ignores items that do not appear. You can choose to modify this behavior:
- To index a list or document library regardless of whether it appears in search results, set the configuration parameter
IgnoreNoCrawl
toTrue
. - To index publishing pages regardless of SEO settings, set the configuration parameter
IgnoreRobotsNoIndex
toTrue
.