To retrieve content from a Web site, create a new fetch task. The connector runs fetch tasks automatically, based on the schedule that is defined in the configuration file.
Tip: For a complete list of the parameters that you can use to configure a fetch task, refer to the Web Connector Reference.
To create a new Fetch Task
In the [FetchTasks]
section of the configuration file, specify the number of fetch tasks using the Number
parameter. If you are configuring the first fetch task, type Number=1
. If one or more fetch tasks have already been configured, increase the value of the Number
parameter by one (1). Below the Number
parameter, specify the names of the fetch tasks, starting from zero (0). For example:
[FetchTasks] Number=1 0=MyTask
Below the [FetchTasks]
section, create a new TaskName section. The name of the section must match the name of the new fetch task. For example:
[FetchTasks] Number=1 0=MyTask [MyTask]
In the new section, set one of the following parameters.
Url
|
The URL of the page to start crawling from. If you set this parameter, the connector crawls the Web site. You can specify multiple URLs to start crawling from by setting a comma-separated list of values or using a numbered list of parameters (Url0= , Url1= , and so on). |
SitemapUrl
|
The URL of a sitemap that lists the pages to ingest. If you set this parameter, only the pages on the sitemap are ingested. The connector does not crawl the site by following links. If you set this parameter, Url is ignored. |
(Optional) If you set the Url
parameter, you can configure the way the connector crawls the site by setting the following parameters:
Depth
|
The maximum depth to which the connector follows links when crawling. For example, to index all pages that can be reached from the Url by following no more than three links, set Depth=3 . The default value of this parameter (Depth=-1 ) specifies no limit. |
StayOnSite
|
A Boolean value that specifies whether the connector stays on the Web site identified by the |
SpiderUrlCantHaveRegex
|
A regular expression to restrict the pages that are crawled by the connector. If the full URL of a page matches the regular expression, the page is not crawled and is not ingested. |
SpiderUrlMustHaveRegex
|
A regular expression to restrict the pages that are crawled by the connector. The full URL of a page must match the regular expression, otherwise it is not crawled and is not ingested. |
For example:
[MyTask] Url=http://www.autonomy.com StayOnSite=true SpiderUrlCantHaveRegex=.*subdomain\.autonomy\.com.*
(Optional) If you want to index specific pages, you can use the following configuration parameters:
PageCantHaveRegex
|
A regular expression to restrict the content retrieved by the connector. If the content of a page matches the regular expression, the page is not ingested. This parameter applies to all file types. |
PageMustHaveRegex
|
A regular expression to restrict the content retrieved by the connector. The content of a page must match the regular expression, otherwise the page is not ingested. This parameter applies to all file types. |
UrlCantHaveRegex
|
A regular expression to restrict the content retrieved by the connector. If the full URL of a page matches the regular expression, the page is not ingested. This parameter applies to all file types, for example HTML pages, images, text documents, and so on. |
UrlMustHaveRegex
|
A regular expression to restrict the content retrieved by the connector. The full URL of a page must match the regular expression, otherwise the page is not ingested. This parameter applies to all file types, for example HTML pages, images, text documents, and so on. |
(Optional) If the connector is installed on a machine that is behind a proxy server, see Retrieve Information through a Proxy Server.
|