Choose the Content to Index

This section explains how to configure the connector so that it retrieves the content that you want to index, and nothing else.

Restrict the Content to Process

The content in SharePoint is organized in the following structure.

Web Application (on-premise only) 
  |- Site Collection
       |- Site
            |- Site
            |    |- ...
            |- Document Library
            |    |- File
            |    |    |- File version(s)
            |    |- Folder
            |         |- File
            |              |- File version(s)
            |- List 
                 |- List Item
                      |- Attachment(s)
                 |- Folder
                      |- List Item
                           |- Attachment(s)

There can be multiple site collections, and multiple sites within a site or site collection. There can be multiple lists and document libraries within a site, multiple folders and files within a document library, and so on.

NOTE: Instances of SharePoint Online have a single site collection at the root level, and no concept of Web Applications.

You can restrict the content to retrieve by setting the following configuration parameters:

  • SiteCollectionUrlCantHaveRegex and SiteCollectionUrlMustHaveRegex restrict the site collections to process, when you are retrieving information from a Web Application (SharePointUrlType=WebApplication).
  • SiteUrlCantHaveRegex and SiteUrlMustHaveRegex restrict the sites to process. If a site is excluded the connector does not index a document for the site and ignores all lists, list items, files, file versions, and attachments contained by the site.

    NOTE: If a site is excluded, its child sites are still processed, unless they are also excluded by one of the regular expressions.

  • ListUrlCantHaveRegex and ListUrlMustHaveRegex restrict the lists and document libraries to process. If a list or document library is excluded, the connector does not index a document for the list or document library and ignores all list items, files, file versions, and attachments contained in the list or document library.

  • ListItemUrlCantHaveRegex and ListItemUrlMustHaveRegex restrict the list items or files to process. If a list item or file is excluded, the connector does not index a document for the list item or file, and ignores all file versions and attachments for that list item or file.

  • IndexAttachments specifies whether to index attachments for list items that are processed.

  • IndexFileVersions and VersionIndexingMode specify which file versions to index for files that have versioning enabled.
  • FileExtnCantHaveCSVs and FileExtnMustHaveCSVs restrict the files and attachments to process, based on their file extension.
  • IndexSites, IndexLists, and IndexFolders specify whether to index a metadata-only document for each container object (sites, lists, and folders respectively).

 

The connector performs best if you choose the objects to process at the highest possible level. Take for example the following structure:

   http://sharepoint/                           Site Collection   
   http://sharepoint/site1/                     Site
   http://sharepoint/site1/List1                List
   http://sharepoint/site1/List1/Item1          List Item
   http://sharepoint/site1/List1/Item2          List Item
   http://sharepoint/site1/List2                List
   http://sharepoint/site1/List2/Item1          List Item
   http://sharepoint/site1/List2/Item2          List Item
   http://sharepoint/site2/                     Site
   http://sharepoint/site2/List1                List
   http://sharepoint/site2/List1/Item1          List Item
   http://sharepoint/site2/SubSite/             Site
   http://sharepoint/site2/SubSite/List1        List
   http://sharepoint/site2/SubSite/List1/Item1  List Item

You could ignore all content from site1 by configuring ListUrlCantHaveRegex=http://sharepoint/site1/.*, but the connector would have to process site1, and all of the lists on that site, just to determine that the lists should be ignored. A more efficient configuration is SiteUrlCantHaveRegex=http://sharepoint/site1/, because the connector can immediately determine that nothing from that site has to be processed.

Similarly, you could ignore content on site2, but still index content on site2/subsite, by configuring ListUrlCantHaveRegex=http://sharepoint/site2/List.*. However, the connector would have to process site2 and all of the lists on that site, just to determine that the lists should be ignored. A more efficient configuration would contain SiteUrlCantHaveRegex=http://sharepoint/site2/$, so that the connector can immediately determine that nothing from site2 has to be processed. The URL for site2/subsite does not match the regular expression http://sharepoint/site2/$, so content from site2/subsite is still processed.

Index Content that does not appear in Search Results

In SharePoint, a user can choose whether to allow items from a list or document library to appear in search results. Users can also choose whether to allow publishing pages (a type of list item) to appear in search engine results, using Search Engine Optimization (SEO) settings. In both cases, by default, the connector ignores items that do not appear. You can choose to modify this behavior:

  • To index a list or document library regardless of whether it appears in search results, set the configuration parameter IgnoreNoCrawl to True.
  • To index publishing pages regardless of SEO settings, set the configuration parameter IgnoreRobotsNoIndex to True.