SitemapUrl

The URL of a sitemap that lists the pages to ingest.

If you set this parameter, only the pages on the sitemap are ingested. The connector does not crawl the site by following links. You can set further parameters, including UrlCantHaveRegex and UrlMustHaveRegex, to filter the pages contained in the sitemap.

TIP: Web Connector can retrieve information in one of the following ways:

  • To start from a URL and follow links to other pages, set the parameter Url.
  • To retrieve the pages contained in a sitemap, set the parameter SitemapUrl. A sitemap is an XML document, used by some web sites to present web crawlers with a list of pages to retrieve. Using a site map is often the best option, if there is one, because the connector retrieves the pages suggested by the site administrator. This can be easier than crawling the site and choosing the pages to ingest based on their URL or content.
  • To retrieve a list of URLs that are specified in a text file, set the parameter SitemapFile. You must create the file, which is not practical for large sites, but you might want to use this option if you have an external process generating the URLs.

In each case the other parameters are ignored. SitemapUrl has precedence, followed by SitemapFile, followed by Url.

You can also set this parameter to the URL of a sitemap index (a list of sitemaps). If you do so, you can choose which of the sitemaps to process by setting the configuration parameters SitemapIndexUrlCantHaveRegex and SitemapIndexUrlMustHaveRegex.

Type: String
Default:  
Required: You must set Url, SitemapUrl, or SitemapFile
Configuration Section: TaskName or FetchTasks or Default
Example: SitemapUrl=http://www.mywebsite.com/sitemap.xml
See Also:

IgnoreSitemapScopeErrors

Url

SitemapFile