Retrieve Recently Updated Content

In some cases you might want to retrieve information from the Web only when it has been recently updated. For example, if you are crawling a news site, you might want to restrict the synchronize task to retrieve pages that were updated in the last 30 days.

To retrieve Web pages based on the date, the HPE Web Connector must be able to extract a date for each page. The connector checks the following sources (in order) and uses the first valid date that it finds:

To retrieve pages based on the date

  1. Stop the connector and open the configuration file.

  2. Modify your fetch task by adding the relevant parameters from the following list, so that the connector can extract the date associated with a page:

    DateInUrl A Boolean value that specifies whether the date associated with a Web page can be extracted from the page URL. When you set this parameter to TRUE, the connector attempts to extract the date from the URL.
    PageDateSelector

    A list of CSS2 selectors to identify elements in the page content that might contain the date associated with a Web page.

    • If the date is contained in the element's content, setting this parameter is sufficient to extract the date.
    • If the date is contained in one of the element's attributes, set this parameter and then set PageDateAttribute to identify the attribute.
    PageDateAttribute A list of attributes (on elements listed by PageDateSelector) that contain the date associated with the Web page.
    PageDateHeader A list of headers to retrieve the date from. The headers are checked in the order that you specify and the first valid date that the connector finds is associated with the page. The connector only attempts to extract a date from HTTP headers if no date is found in the page URL (or DateInUrl=FALSE) and no date is found in the page content (or PageDateSelector is not set).
    DateFormats A comma-separated list of date formats to use when searching for a date.
  3. Choose which pages to retrieve:

  4. (Optional) By default the connector ingests pages where it cannot determine a date. To prevent the connector from ingesting these pages, set the parameter IngestPagesWithNoDate to False.
  5. (Optional) By default the connector does crawl pages that are outside your specified date range, even though they are not ingested. To prevent the connector following links from these pages, set the parameter SpiderDateFilteredPages to False.
  6. Save and close the configuration file.

Example

The following example configuration retrieves pages that are less than 30 days old, based on a date contained in the URL or in the page content:

[MyTask]
Url=http://www.autonomy.com
...
DateInUrl=true
PageDateSelector=p[id=date]
MaxPageAge=30 days
IngestPagesWithNoDate=False
SpiderDateFilteredPages=False

_HP_HTML5_bannerTitle.htm