Retrieve Recently Updated Content

In some cases you might want to retrieve information from the Web only when it has been recently updated. For example, if you are crawling a news site, you might want to restrict the synchronize task to retrieve pages that were updated in the last 30 days.

To retrieve Web pages based on the date, the Web Connector must be able to extract a date for each page. The connector checks the following sources (in order) and uses the first valid date that it finds:

  • The URL of the page (but only if DateInUrl is set to TRUE).
  • The page content (but only if you set PageDateSelector).
  • The HTTP headers returned with the page. The connector attempts to extract a date from the headers named by the parameter PageDateHeader.

To retrieve pages based on the date

  1. Stop the connector and open the configuration file.

  2. Modify your fetch task by adding the relevant parameters from the following list, so that the connector can extract the date associated with a page:

    DateInUrl A Boolean value that specifies whether the date associated with a Web page can be extracted from the page URL. When you set this parameter to TRUE, the connector attempts to extract the date from the URL.
    PageDateSelector

    A list of CSS selectors to identify elements in the page content that might contain the date associated with a Web page.

    • If the date is contained in the element's content, setting this parameter is sufficient to extract the date.
    • If the date is contained in one of the element's attributes, set this parameter and then set PageDateAttribute to identify the attribute.
    PageDateAttribute A list of attributes (on elements listed by PageDateSelector) that contain the date associated with the Web page.
    PageDateHeader A list of headers to retrieve the date from. The headers are checked in the order that you specify and the first valid date that the connector finds is associated with the page. The connector only attempts to extract a date from HTTP headers if no date is found in the page URL (or DateInUrl=FALSE) and no date is found in the page content (or PageDateSelector is not set).
    DateFormats A comma-separated list of date formats to use when searching for a date.
  3. Set one or both of the following parameters.

    MinPageDate Filters the pages that are ingested by date. The connector only ingests pages that are newer than the specified date.
    MaxPageDate Filters the pages that are ingested by date. The connector only ingests pages that are older than the specified date.

    You can configure a fixed limit, for example to retrieve pages that were updated after 01 July 2015. Alternatively you can configure a rolling limit, for example to retrieve pages that are not more than 30 days old when the synchronize task starts.

  4. (Optional) By default the connector ingests pages where it cannot determine a date. To prevent the connector from ingesting these pages, set the parameter IngestPagesWithNoDate to False.
  5. (Optional) By default the connector does crawl pages that are outside your specified date range, even though they are not ingested. To prevent the connector following links from these pages, set the parameter SpiderDateFilteredPages to False.
  6. Save and close the configuration file.

Example

The following example configuration retrieves pages that are less than 30 days old, based on a date contained in the URL or in the page content:

[MyTask]
Url=http://www.example.com/
...
DateInUrl=true
PageDateSelector=p[id=date]
MinPageDate=-30 days
IngestPagesWithNoDate=False
SpiderDateFilteredPages=False