This section describes how to retrieve content from a Web site by retrieving URLs contained in a sitemap. If you configure the connector to use a sitemap, the connector does not follow links between pages.
To create a new Fetch Task
In the [FetchTasks]
section of the configuration file, specify the number of fetch tasks using the Number
parameter. If you are configuring the first fetch task, type Number=1
. If one or more fetch tasks have already been configured, increase the value of the Number
parameter by one (1). Below the Number
parameter, specify the names of the fetch tasks, starting from zero (0). For example:
[FetchTasks] Number=1 0=MyTask
Below the [FetchTasks]
section, create a new TaskName section. The name of the section must match the name of the new fetch task. For example:
[FetchTasks] Number=1 0=MyTask [MyTask]
Choose whether to provide the URL of the sitemap, or configure the connector to automatically find the sitemap by reading the robots.txt
file.
Specify the URL of the sitemap:
SitemapUrl
|
The URL of a sitemap that lists the pages to ingest. If you set this parameter, only the pages on the sitemap are ingested. The connector does not crawl the site by following links. |
Configure the connector to find the sitemap(s) from robots.txt
:
UseSitemapFromRobots
|
To configure the connector to look in robots.txt for the URL of the sitemap, set this parameter to TRUE . If the connector cannot find the location of the sitemap in robots.txt , it falls back to crawling the site from the specified URL. |
Url
|
The URL of the site. You must specify only one URL. |
For example:
[MyTask] UseSitemapFromRobots=TRUE Url=https://www.hpe.com
(Optional) If you want to index specific pages, you can use the following configuration parameters:
PageCantHaveRegex
|
A regular expression to restrict the content retrieved by the connector. If the content of a page matches the regular expression, the page is not ingested. This parameter applies to all file types. |
PageMustHaveRegex
|
A regular expression to restrict the content retrieved by the connector. The content of a page must match the regular expression, otherwise the page is not ingested. This parameter applies to all file types. |
UrlCantHaveRegex
|
A regular expression to restrict the content retrieved by the connector. If the full URL of a page matches the regular expression, the page is not ingested. This parameter applies to all file types, for example HTML pages, images, text documents, and so on. |
UrlMustHaveRegex
|
A regular expression to restrict the content retrieved by the connector. The full URL of a page must match the regular expression, otherwise the page is not ingested. This parameter applies to all file types, for example HTML pages, images, text documents, and so on. |
(Optional) If the connector is installed on a machine that is behind a proxy server, see Retrieve Information through a Proxy Server.
|