Web Connector

24.3.0

There were no new features or resolved issues in this release.

24.2.0

New Features

  • The embedded web browser in Web Connector has been upgraded to Chromium 115. (Apart from the FIPS-compliant platforms, which use Qt Webkit 5.1.1).
  • The connector can wait for a web page to reach a specified state before processing the page. You might want to do this if the website uses JavaScript to load page content and you need to make sure the content has finished loading. For example, if a "loading" icon is initially displayed, you could configure the connector to wait until it is removed. For more information about this feature, refer to the Web Connector documentation. The new parameters include WaitForStateSections, WaitForStateUrlRegex, WaitForStateSelector, WaitForStateValidator, and WaitForStateTimeout.
  • The configuration parameter FormInputLuaScript has been added, so that you can populate the fields of an HTML form programmatically through Lua.

Resolved Issues

There were no resolved issues in this release.

24.1.0

New Features

  • The view action can be configured to return an image, thumbnail image, or PDF of a web page. This provides a preview of the page that reflects its appearance when it was last synchronized. (Normally, the view action returns the HTML source of the page at the time of the request). This feature ensures that the preview returned by the view action matches the data in your IDOL index. It can also result in a better preview because rendering HTML returned from the view action might not work well if a page requires external resources such as stylesheets and scripts. For more information about this feature, refer to the documentation for the new configuration parameter ViewRenditionFormat.

Resolved Issues

  • The connector could terminate unexpectedly when UseSitemapFromRobots=TRUE and the robots.txt file pointed to a sitemap that was an index of other sitemaps.

23.4.0

New Features

  • The configuration parameter MaxLinkPercentagePerPage has been added, so that you can more easily prevent the connector from ingesting index and navigation pages. This parameter specifies the maximum percentage of textual content on a page that can be represented by links, before the connector does not ingest the page. Pages that exceed the maximum percentage are still crawled and the connector follows the links that it finds.

Resolved Issues

There were no resolved issues in this release.

23.3.0

New Features

  • The configuration parameters SitemapIndexUrlCantHaveRegex and SitemapIndexUrlMustHaveRegex have been added, so that you can choose the sitemaps to process when you configure the connector to process a sitemap index (which is a list of sitemaps).
  • The View action returns document metadata. To obtain the metadata set the action parameter NoACI=FALSE, because by default the View action returns the binary content of the file.

Resolved Issues

  • Crash dumps for the embedded web browser (WKOOP) were written to disk even though this behavior was disabled. The connector is designed in such a way that WKOOP can terminate without adversely affecting the connector.
  • The connector did not handle cookies correctly (all cookies were treated as if their name and value were empty).

23.2.0

New Features

  • Web Connector can use the Mozilla readability library to clip pages. Clipping removes uninteresting parts of a page such as navigation bars and advertisements, to prevent irrelevant information being added to the IDOL index. Automatic clipping was available in previous versions of Web Connector but the readability library produces better results in some cases. To clip pages using the readability library, set ClippingMode=READABILITY. This feature is not available on FIPS-compliant platforms.
  • When extracting metadata from a page, Web Connector can write the information into structured document fields. Earlier versions of Web Connector could not be configured to write this information into sub-fields.
  • The connector has a new configuration parameter, FormSubmissionType, which specifies whether the connector should expect page navigation to occur when a form is submitted. If the connector does not observe the specified behavior, the page is not processed and is retried during the next synchronize cycle.
  • To improve performance, the connector stores downloaded robot protocol data in a cache. The configuration parameters RobotProtolEnableCache (default TRUE) and RobotProtocolCacheDirectory have been added so that you can configure this feature.
  • The configuration parameter CookieStoreDirectory has been added, so that you can choose where cookies are stored.
  • The connector includes an example configuration for retrieving data from Google News.

Resolved Issues

  • When processing a site map with a large number of entries, Web Connector could queue ingest-delete commands for web pages that still existed.
  • UTF-8 multibyte whitespace was not handled correctly when the configuration parameter NormalizeWhitespace was TRUE.
  • The connector could incorrectly remove whitespace between some words when the configuration parameter NormalizeWhitespace was TRUE.
  • There was an issue with the Web Connector 12.13 zip package for Linux x86-64 (an incorrect version of libstdc++.so.6 was included). When installed from this package, the embedded web browser (WKOOP) did not start.

Notes

  • As a result of the improvements to clipping, you must update any task configurations that include clipping. The Clipped parameter has been removed and replaced with a new parameter, ClippingMode. For more information about clipping in Web Connector 23.2, please refer to the Web Connector Help.
  • As a result of the improvements to metadata extraction, you must update any task configurations that extract metadata.

    • In earlier versions of Web Connector, the MetadataSelector and MetadataFieldName parameters accepted multiple values. With Web Connector 23.2 these parameters accept a single value and you should use the new parameter MetadataFieldSections to specify the names of sections that contain these parameters (one section for each field that you want to create in your IDOL documents).

      The configuration parameter MetadataSelectorExtractPlainText has been renamed to MetadataExtractPlainText.

      For example:

      Web Connector 12.13 Web Connector 23.2
      [MyTask]
      ...
      MetadataSelector0=h1
      MetadataFieldName0=HeadingOne
      MetadataSelectorExtractPlainText=TRUE
      [MyTask]
      ...
      MetadataFieldSections0=ExtractH1
      
      [ExtractH1]
      MetadataSelector=h1
      MetadataFieldName=HeadingOne
      MetadataExtractPlainText=TRUE
    • The configuration parameters ChildMetadataSelector, ChildMetadataSelectorExtractPlainText, ChildMetadataFieldName, and ChildMetadataAttribute have been removed. You can now extract metadata for child documents using the same parameters that you would use for the main document. Use the new parameter ChildDocumentMetadataFieldSections to specify the names of sections that contain settings for metadata extraction from child documents.

Deprecated Features

The following features are deprecated and might be removed in a future release.

Category Deprecated Feature Deprecated Since
Clipping The SMARTPRINT clipping mode has been deprecated. If you have set ClippingMode=SMARTPRINT, OpenText recommends choosing a different mode. 23.3.0