Retrieve Recently Updated Content
In some cases you might want to retrieve information from the Web only when it has been recently updated. For example, if you are crawling a news site, you might want to restrict the synchronize task to retrieve pages that were updated in the last 30 days.
To retrieve Web pages based on the date, the Web Connector must be able to extract a date for each page. The connector checks the following sources (in order) and uses the first valid date that it finds:
- The URL of the page (but only if
DateInUrl
is set toTRUE
). - The page content (but only if you set
PageDateSelector
). - The HTTP headers returned with the page. The connector attempts to extract a date from the headers named by the parameter
PageDateHeader
.
To retrieve pages based on the date
-
Stop the connector and open the configuration file.
-
Modify your fetch task by adding the relevant parameters from the following list, so that the connector can extract the date associated with a page:
DateInUrl
A Boolean value that specifies whether the date associated with a Web page can be extracted from the page URL. When you set this parameter to TRUE
, the connector attempts to extract the date from the URL.PageDateSelector
A list of CSS selectors to identify elements in the page content that might contain the date associated with a Web page.
- If the date is contained in the element's content, setting this parameter is sufficient to extract the date.
- If the date is contained in one of the element's attributes, set this parameter and then set
PageDateAttribute
to identify the attribute.
PageDateAttribute
A list of attributes (on elements listed by PageDateSelector
) that contain the date associated with the Web page.PageDateHeader
A list of headers to retrieve the date from. The headers are checked in the order that you specify and the first valid date that the connector finds is associated with the page. The connector only attempts to extract a date from HTTP headers if no date is found in the page URL (or DateInUrl=FALSE
) and no date is found in the page content (orPageDateSelector
is not set).DateFormats
A comma-separated list of date formats to use when searching for a date. -
Set one or both of the following parameters.
MinPageDate
Filters the pages that are ingested by date. The connector only ingests pages that are newer than the specified date. MaxPageDate
Filters the pages that are ingested by date. The connector only ingests pages that are older than the specified date. You can configure a fixed limit, for example to retrieve pages that were updated after 01 July 2015. Alternatively you can configure a rolling limit, for example to retrieve pages that are not more than 30 days old when the synchronize task starts.
- (Optional) By default the connector ingests pages where it cannot determine a date. To prevent the connector from ingesting these pages, set the parameter
IngestPagesWithNoDate
toFalse
. - (Optional) By default the connector does crawl pages that are outside your specified date range, even though they are not ingested. To prevent the connector following links from these pages, set the parameter
SpiderDateFilteredPages
toFalse
. - Save and close the configuration file.
Example
The following example configuration retrieves pages that are less than 30 days old, based on a date contained in the URL or in the page content:
[MyTask] Url=http://www.example.com/ ... DateInUrl=true PageDateSelector=p[id=date] MinPageDate=-30 days IngestPagesWithNoDate=False SpiderDateFilteredPages=False