HTML Processing with WKOOP

The WKOOPHtmlExtraction task processes an HTML file that is associated with a document. It extracts links and metadata and adds these to the document in a metadata field named HTML_PROCESSING. The task appends a page to the document content that contains the plain text extracted from the HTML source. It also sets the field AUTN_NO_FILTER, to prevent the document being processed by KeyView.

This section describes how to configure HTML processing with WKOOP.

You can configure WKOOP HTML extraction as a pre-import task (Pre0 in the following example). The Pre0 parameter also specifies the name of a section that contains the settings for the task. In the following example the section is named HtmlProcessingSettings.

[ImportTasks]
Pre0=WKOOPHtmlExtraction:HtmlProcessingSettings

[HtmlProcessingSettings]
WKOOPPath=F:\IDOL\WebConnector\WKOOP.exe
ProxyHost=proxy.domain.com
ProxyPort=8080
SSLMethod=NEGOTIATE
ExtractLinks=TRUE
ResolveLinks=TRUE
Url=https://www.example.com/

The WKOOPPath parameter specifies the path to WKOOP. WKOOP is not included with CFS, so you must install the IDOL Web Connector and specify the path to the WKOOP executable file. You must install a version of WKOOP that is the same as, or later than, the version of CFS that you are using.

If you are running CFS on a machine that is behind a proxy server, set the ProxyHost and ProxyPort parameters to specify the proxy server to use to access the web. The SSLMethod parameter specifies the version of SSL or TLS to use when connecting to the web site, and is necessary to retrieve resources over HTTPS. Setting this parameter to NEGOTIATE uses the latest version that is supported by both CFS and the web server.

The ExtractLinks parameter accepts a Boolean value that specifies whether to extract links from HTML pages and add the links to the document metadata. When ResolveLinks=TRUE the links are resolved so that indexed documents contain absolute URLs. The Url parameter specifies the source URL so that links can be resolved. You do not need to specify the exact URL of the page being processed, as long as all URLs in the document being processed are relative to the web server.

For a full list of configuration parameters that you can use to configure WKOOP HTML extraction, refer to the Connector Framework Server Reference.