Web Connector

23.2.1

Web Connector 23.2.1 resolves an issue where the connector did not handle cookies correctly (all cookies were treated as if their name and value were empty).

23.2.0

New Features

  • Web Connector can use the Mozilla readability library to clip pages. Clipping removes uninteresting parts of a page such as navigation bars and advertisements, to prevent irrelevant information being added to the IDOL index. Automatic clipping was available in previous versions of Web Connector but the readability library produces better results in some cases. To clip pages using the readability library, set ClippingMode=READABILITY. This feature is not available on FIPS-compliant platforms.
  • When extracting metadata from a page, Web Connector can write the information into structured document fields. Earlier versions of Web Connector could not be configured to write this information into sub-fields.
  • The connector has a new configuration parameter, FormSubmissionType, which specifies whether the connector should expect page navigation to occur when a form is submitted. If the connector does not observe the specified behavior, the page is not processed and is retried during the next synchronize cycle.
  • To improve performance, the connector stores downloaded robot protocol data in a cache. The configuration parameters RobotProtolEnableCache (default TRUE) and RobotProtocolCacheDirectory have been added so that you can configure this feature.
  • The configuration parameter CookieStoreDirectory has been added, so that you can choose where cookies are stored.
  • The connector includes an example configuration for retrieving data from Google News.

Resolved Issues

  • When processing a site map with a large number of entries, Web Connector could queue ingest-delete commands for web pages that still existed.
  • UTF-8 multibyte whitespace was not handled correctly when the configuration parameter NormalizeWhitespace was TRUE.
  • The connector could incorrectly remove whitespace between some words when the configuration parameter NormalizeWhitespace was TRUE.
  • There was an issue with the Web Connector 12.13 zip package for Linux x86-64 (an incorrect version of libstdc++.so.6 was included). When installed from this package, the embedded web browser (WKOOP) did not start.

Notes

  • As a result of the improvements to clipping, you must update any task configurations that include clipping. The Clipped parameter has been removed and replaced with a new parameter, ClippingMode. For more information about clipping in Web Connector 23.2, please refer to the Web Connector Help.
  • As a result of the improvements to metadata extraction, you must update any task configurations that extract metadata.

    • In earlier versions of Web Connector, the MetadataSelector and MetadataFieldName parameters accepted multiple values. With Web Connector 23.2 these parameters accept a single value and you should use the new parameter MetadataFieldSections to specify the names of sections that contain these parameters (one section for each field that you want to create in your IDOL documents).

      The configuration parameter MetadataSelectorExtractPlainText has been renamed to MetadataExtractPlainText.

      For example:

      Web Connector 12.13 Web Connector 23.2
      [MyTask]
      ...
      MetadataSelector0=h1
      MetadataFieldName0=HeadingOne
      MetadataSelectorExtractPlainText=TRUE
      [MyTask]
      ...
      MetadataFieldSections0=ExtractH1
      
      [ExtractH1]
      MetadataSelector=h1
      MetadataFieldName=HeadingOne
      MetadataExtractPlainText=TRUE
    • The configuration parameters ChildMetadataSelector, ChildMetadataSelectorExtractPlainText, ChildMetadataFieldName, and ChildMetadataAttribute have been removed. You can now extract metadata for child documents using the same parameters that you would use for the main document. Use the new parameter ChildDocumentMetadataFieldSections to specify the names of sections that contain settings for metadata extraction from child documents.

Supported Platforms

Web Connector 23.2.0 is supported on the following platforms.

Windows (x86-64)

  • Windows Server 2022
  • Windows Server 2019
  • Windows Server 2016
  • Windows Server 2012

Linux (x86-64)

The minimum supported versions of particular distributions are:

  • Red Hat Enterprise Linux (RHEL) 7
  • CentOS 7
  • SuSE Linux Enterprise Server (SLES) 12
  • Ubuntu 14.04
  • Debian 8

Documentation

The following documentation is available for Web Connector version 23.2.0.

  • Web Connector Help