Incremental Synchronization

The Web Connector supports incremental synchronization. The first time the connector synchronizes with a web site, it ingests all of the content that is requested by the task configuration. Subsequently, the connector only ingests information that is new or has changed.

The connector stores the date and time when it requests a page. The next time it requests the page, it uses this information to set the If-Modified-Since header in the request sent to the web server. If the web server supports the If-Modified-Since header and does not consider the page to have been modified, it returns an HTTP 304 response code, which tells the connector that the page has not changed. Otherwise, it returns the page.

If a page is returned, the connector performs a check to determine whether the page content has changed. Only if the connector determines that the page has changed does it ingest the page. This check occurs after clipping (if you have configured clipping) so that changes to parts of the page that are clipped do not cause the page to be re-ingested.

TIP: Many web pages have dynamic content, and can appear to change every time they are requested.

Remove Irrelevant Content

Sometimes a page has changes, but the changes are located in part of the page that you consider irrelevant. You can configure the connector to discard irrelevant content so that changes in this content do not cause a page to be re-ingested. Web Connector has several features that you can use to do this:

  • Clipping removes irrelevant content, such as headers, footers, navigation bars, and advertisements.
  • You can set the following parameters to remove parts of a page:

    • RemoveComments removes HTML comments.
    • RemoveScripts removes scripts.
    • RemoveNoframes and RemoveNoscripts remove <noframes> and <noscript> content.
  • You can set the parameter IngestAsPlainText so that the connector extracts text from a web page, adds it to the DRECONTENT field, and then ingests a metadata-only document (a document without an associated HTML file).