Retrieve Information using a Sitemap

This section describes how to retrieve content from a Web site by retrieving URLs contained in a sitemap. If you configure the connector to use a sitemap, the connector does not follow links between pages.

To create a new Fetch Task

  1. Stop the connector.
  2. Open the configuration file in a text editor.
  3. In the [FetchTasks] section of the configuration file, specify the number of fetch tasks using the Number parameter. If you are configuring the first fetch task, type Number=1. If one or more fetch tasks have already been configured, increase the value of the Number parameter by one (1). Below the Number parameter, specify the names of the fetch tasks, starting from zero (0). For example:

    [FetchTasks]
    Number=1
    0=MyTask
  4. Below the [FetchTasks] section, create a new TaskName section. The name of the section must match the name of the new fetch task. For example:

    [FetchTasks]
    Number=1
    0=MyTask
    
    [MyTask]
  1. Choose whether to provide the URL of the sitemap, or configure the connector to automatically find the sitemap by reading the robots.txt file.

    • Specify the URL of the sitemap:

      SitemapUrl The URL of a sitemap that lists the pages to ingest. If you set this parameter, only the pages on the sitemap are ingested. The connector does not crawl the site by following links.
    • Configure the connector to find the sitemap(s) from robots.txt:

      UseSitemapFromRobots To configure the connector to look in robots.txt for the URL of the sitemap, set this parameter to TRUE. If the connector cannot find the location of the sitemap in robots.txt, it falls back to crawling the site from the specified URL.
      Url The URL of the site. You must specify only one URL.

    For example:

    [MyTask]
    UseSitemapFromRobots=TRUE
    Url=https://www.hpe.com/
  2. (Optional) If you want to index specific pages, you can use the following configuration parameters:

    PageCantHaveRegex A regular expression to restrict the content retrieved by the connector. If the content of a page matches the regular expression, the page is not ingested. This parameter applies to all file types.
    PageMustHaveRegex A regular expression to restrict the content retrieved by the connector. The content of a page must match the regular expression, otherwise the page is not ingested. This parameter applies to all file types.
    UrlCantHaveRegex A regular expression to restrict the content retrieved by the connector. If the full URL of a page matches the regular expression, the page is not ingested. This parameter applies to all file types, for example HTML pages, images, text documents, and so on.
    UrlMustHaveRegex A regular expression to restrict the content retrieved by the connector. The full URL of a page must match the regular expression, otherwise the page is not ingested. This parameter applies to all file types, for example HTML pages, images, text documents, and so on.
  3. (Optional) If the connector is installed on a machine that is behind a proxy server, see Retrieve Information through a Proxy Server.

  4. Save and close the configuration file.