Retrieve Information by Crawling the Web

This section describes how to retrieve content from a Web site by crawling a site (following links from one page to another).

TIP: IDOL Expert includes a full step-by-step guide that demonstrates how to ingest information from a website. The IDOL Expert guide discusses strategies for selecting the content to crawl and ingest, and includes an example configuration that you can run.

To create a new Fetch Task

  1. Stop the connector.
  2. Open the configuration file in a text editor.
  3. In the [FetchTasks] section of the configuration file, specify the number of fetch tasks using the Number parameter. If you are configuring the first fetch task, type Number=1. If one or more fetch tasks have already been configured, increase the value of the Number parameter by one (1). Below the Number parameter, specify the names of the fetch tasks, starting from zero (0). For example:

    [FetchTasks]
    Number=1
    0=MyTask
  4. Below the [FetchTasks] section, create a new TaskName section. The name of the section must match the name of the new fetch task. For example:

    [FetchTasks]
    Number=1
    0=MyTask
    
    [MyTask]
  1. In the new section, set the following parameters.

    Url The URL of the page to start crawling from. You can specify multiple URLs to start crawling from by setting a comma-separated list of values or using a numbered list of parameters (Url0=, Url1=, and so on).
    Depth (Optional) The maximum depth to which the connector follows links when crawling. For example, to index all pages that can be reached from the Url by following no more than three links, set Depth=3. The default value of this parameter (Depth=-1) specifies no limit.
    StayOnSite

    (Optional) A Boolean value that specifies whether the connector stays on the Web site identified by the Url parameter. To allow the connector to follow links to other Web sites, set this parameter to false.

    SpiderUrlCantHaveRegex (Optional) A regular expression to restrict the pages that are crawled by the connector. If the full URL of a page matches the regular expression, the page is not downloaded, crawled, or ingested.
    SpiderUrlMustHaveRegex (Optional) A regular expression to restrict the pages that are crawled by the connector. The full URL of a page must match the regular expression, otherwise it is not downloaded, crawled, or ingested.

    For example:

    [MyTask]
    Url=http://www.example.com/
    StayOnSite=true
    SpiderUrlCantHaveRegex=.*subdomain\.example\.com.*
  2. (Optional) If you want to index specific pages, you can use the following configuration parameters:

    PageCantHaveRegex A regular expression to restrict the content retrieved by the connector. If the content of a page matches the regular expression, the page is not ingested. This parameter applies to all file types.
    PageMustHaveRegex A regular expression to restrict the content retrieved by the connector. The content of a page must match the regular expression, otherwise the page is not ingested. This parameter applies to all file types.
    UrlCantHaveRegex

    A regular expression to restrict the content retrieved by the connector. If the full URL of a page matches the regular expression, the page is not ingested. This parameter applies to all file types, for example HTML pages, images, text documents, and so on.

    UrlMustHaveRegex A regular expression to restrict the content retrieved by the connector. The full URL of a page must match the regular expression, otherwise the page is not ingested. This parameter applies to all file types, for example HTML pages, images, text documents, and so on.
  3. (Optional) If the connector is installed on a machine that is behind a proxy server, see Retrieve Information through a Proxy Server.

  4. Save and close the configuration file.