Choose the Content to Index

When you configure a fetch task to retrieve information from the Web, you can exclude pages from being downloaded and crawled, and exclude pages from being ingested.

There is an important difference between these two pairs of parameters. Pages can be crawled (the links on the page are followed by the connector) but not ingested.

For example, consider the following site structure:

index.html
 |- products/software.html
 |    |- products/software/product1.html
 |    |- products/software/product2.html
 |    |- products/software/product3.html
 |
 |- products/hardware.html
      |- products/hardware/hardware1.html
      |- products/hardware/hardware2.html
      |- products/hardware/hardware3.html

If you set SpiderUrlCantHaveRegex=.*software\.html, the connector does not download or crawl the page products/software.html, so the links to the pages product1.html, product2.html, and product3.html are not followed. The pages highlighted below are therefore not ingested:

index.html
 |- products/software.html
 |    |- products/software/product1.html
 |    |- products/software/product2.html
 |    |- products/software/product3.html
 |
 |- products/hardware.html
      |- products/hardware/hardware1.html
      |- products/hardware/hardware2.html
      |- products/hardware/hardware3.html

Tip: Site structures are usually more complex than the example shown here. If the page hardware1.html contained a link to product1.html, product1.html would still be crawled and ingested.

Alternatively, if you set UrlCantHaveRegex=.*software\.html, the connector does not ingest the page products/software.html, but the page is still crawled for links. The pages product1.html, product2.html and product3.html do not match the regular expression and so they are still ingested. Only the single page highlighted below is excluded from being ingested:

index.html
 |- products/software.html
 |    |- products/software/product1.html
 |    |- products/software/product2.html
 |    |- products/software/product3.html
 |
 |- products/hardware.html
      |- products/hardware/hardware1.html
      |- products/hardware/hardware2.html
      |- products/hardware/hardware3.html

_HP_HTML5_bannerTitle.htm