In some cases you might want to retrieve information from the Web only when it has been recently updated. For example, if you are crawling a news site, you might want to restrict the synchronize task to retrieve pages that were updated in the last 30 days.
To retrieve Web pages based on the date, the HPE Web Connector must be able to extract a date for each page. The connector checks the following sources (in order) and uses the first valid date that it finds:
DateInUrl
is set to TRUE
).PageDateSelector
).PageDateHeader
.To retrieve pages based on the date
Stop the connector and open the configuration file.
Modify your fetch task by adding the relevant parameters from the following list, so that the connector can extract the date associated with a page:
DateInUrl
|
A Boolean value that specifies whether the date associated with a Web page can be extracted from the page URL. When you set this parameter to TRUE , the connector attempts to extract the date from the URL. |
PageDateSelector
|
A list of CSS2 selectors to identify elements in the page content that might contain the date associated with a Web page.
|
PageDateAttribute
|
A list of attributes (on elements listed by PageDateSelector ) that contain the date associated with the Web page. |
PageDateHeader
|
A list of headers to retrieve the date from. The headers are checked in the order that you specify and the first valid date that the connector finds is associated with the page. The connector only attempts to extract a date from HTTP headers if no date is found in the page URL (or DateInUrl=FALSE ) and no date is found in the page content (or PageDateSelector is not set). |
DateFormats
|
A comma-separated list of date formats to use when searching for a date. |
Choose which pages to retrieve:
To configure a rolling limit, for example to retrieve pages that are not more than 30 days old when the synchronize task starts, set the following parameters:
MaxPageAge
|
The maximum age that a page can reach and still be ingested. |
MinPageAge
|
The minimum age that a page must reach before it is ingested. |
To configure a fixed limit, for example to retrieve pages that were updated after 01 July 2015, set the following parameters:
MinPageDate
|
Filters the pages that are ingested by date. The connector only ingests pages that are newer than the specified date. |
MaxPageDate
|
Filters the pages that are ingested by date. The connector only ingests pages that are older than the specified date. |
IngestPagesWithNoDate
to False
. SpiderDateFilteredPages
to False
. The following example configuration retrieves pages that are less than 30 days old, based on a date contained in the URL or in the page content:
[MyTask] Url=http://www.autonomy.com ... DateInUrl=true PageDateSelector=p[id=date] MaxPageAge=30 days IngestPagesWithNoDate=False SpiderDateFilteredPages=False
|