SitemapFile

The path to a plain text file that contains a list of pages to ingest. The file must contain one URL on each line.

The connector only ingests pages that are listed in the file; it does not crawl the web by following links. You can set further parameters, including UrlCantHaveRegex and UrlMustHaveRegex, to filter the URLs contained in the file. The connector respects the robot protocol by default, but you can configure this using the parameter FollowRobotProtocol.

The file can be generated manually or by an external process, and can be updated. If a URL is removed from the file, the connector sends an ingest-delete for that page on the next synchronize cycle.

TIP: Web Connector can retrieve information in one of the following ways:

  • To start from a URL and follow links to other pages, set the parameter Url.
  • To retrieve the pages contained in a sitemap, set the parameter SitemapUrl. A sitemap is an XML document, used by some web sites to present web crawlers with a list of pages to retrieve. Using a site map is often the best option, if there is one, because the connector retrieves the pages suggested by the site administrator. This can be easier than crawling the site and choosing the pages to ingest based on their URL or content.
  • To retrieve a list of URLs that are specified in a text file, set the parameter SitemapFile. You must create the file, which is not practical for large sites, but you might want to use this option if you have an external process generating the URLs.

In each case the other parameters are ignored. SitemapUrl has precedence, followed by SitemapFile, followed by Url.

Type: String
Default:  
Required: You must set Url, SitemapUrl, or SitemapFile
Configuration Section: TaskName or FetchTasks
Example: SitemapFile=my-list-of-urls.txt
See Also:

Url

SitemapUrl