Process HTML
Connectors, including the IDOL Web Connector, can send documents to CFS that have associated HTML files.
CFS can send the HTML files to KeyView, which discards the HTML markup and extracts the text contained in the file. However, HTML pages often contain irrelevant content such as invalid HTML, headers, sidebars, advertisements, and scripts. This text does not contain any useful information and could pollute the IDOL index, degrading performance. KeyView does not remove this irrelevant content, so Connector Framework Server provides features to process HTML files.
-
HTML processing with WKOOP. CFS can use an embedded browser (WKOOP) to process HTML in a similar way to the IDOL Web Connector. There are many reasons to use WKOOP over other methods of processing HTML:
- The browser allows scripts to run before the page is processed, so CFS can extract content and links that are added by JavaScript.
- Links are resolved before a document is ingested, so that indexed documents contain absolute URLs.
- You can remove unwanted content using the automatic clipping algorithm, or by selecting parts of the page with CSS selectors.
- You can extract metadata or divide pages into multiple documents using CSS selectors rather than regular expressions.
NOTE: To use WKOOP you must also install the IDOL Web Connector, because WKOOP is not provided with CFS. You must install a version of WKOOP that is the same as, or later than, the version of CFS that you are using.
- HTMLExtraction. HTML extraction extracts the useful information from the page and discards the irrelevant content. It automatically determines which content is relevant, so there are no configuration parameters for customizing this operation. If HTML extraction does not produce good results for your use case, you might want to use the clipping features provided by WKOOP, instead.