This section describes how to retrieve content from a Web site by crawling a site (following links from one page to another).
To create a new Fetch Task
In the [FetchTasks]
section of the configuration file, specify the number of fetch tasks using the Number
parameter. If you are configuring the first fetch task, type Number=1
. If one or more fetch tasks have already been configured, increase the value of the Number
parameter by one (1). Below the Number
parameter, specify the names of the fetch tasks, starting from zero (0). For example:
[FetchTasks] Number=1 0=MyTask
Below the [FetchTasks]
section, create a new TaskName section. The name of the section must match the name of the new fetch task. For example:
[FetchTasks] Number=1 0=MyTask [MyTask]
In the new section, set the following parameters.
Url
|
The URL of the page to start crawling from. You can specify multiple URLs to start crawling from by setting a comma-separated list of values or using a numbered list of parameters (Url0= , Url1= , and so on). |
Depth
|
(Optional) The maximum depth to which the connector follows links when crawling. For example, to index all pages that can be reached from the Url by following no more than three links, set Depth=3 . The default value of this parameter (Depth=-1 ) specifies no limit. |
StayOnSite
|
(Optional) A Boolean value that specifies whether the connector stays on the Web site identified by the |
SpiderUrlCantHaveRegex
|
(Optional) A regular expression to restrict the pages that are crawled by the connector. If the full URL of a page matches the regular expression, the page is not downloaded, crawled, or ingested. |
SpiderUrlMustHaveRegex
|
(Optional) A regular expression to restrict the pages that are crawled by the connector. The full URL of a page must match the regular expression, otherwise it is not downloaded, crawled, or ingested. |
For example:
[MyTask] Url=http://www.autonomy.com StayOnSite=true SpiderUrlCantHaveRegex=.*subdomain\.autonomy\.com.*
(Optional) If you want to index specific pages, you can use the following configuration parameters:
PageCantHaveRegex
|
A regular expression to restrict the content retrieved by the connector. If the content of a page matches the regular expression, the page is not ingested. This parameter applies to all file types. |
PageMustHaveRegex
|
A regular expression to restrict the content retrieved by the connector. The content of a page must match the regular expression, otherwise the page is not ingested. This parameter applies to all file types. |
UrlCantHaveRegex
|
A regular expression to restrict the content retrieved by the connector. If the full URL of a page matches the regular expression, the page is not ingested. This parameter applies to all file types, for example HTML pages, images, text documents, and so on. |
UrlMustHaveRegex
|
A regular expression to restrict the content retrieved by the connector. The full URL of a page must match the regular expression, otherwise the page is not ingested. This parameter applies to all file types, for example HTML pages, images, text documents, and so on. |
(Optional) If the connector is installed on a machine that is behind a proxy server, see Retrieve Information through a Proxy Server.
|