HTML Extraction

HTML pages often contain irrelevant content such as invalid HTML, headers, sidebars, advertisements, and scripts. CFS can extract the useful information from the page and discard the irrelevant content.

To extract the useful information from an HTML page, use the HtmlExtraction import task. This task works only on HTML files and ignores other file types.

CFS reads the HTML document, and discards data such as invalid HTML, headers, sidebars, advertisements, and scripts. In the remaining content, CFS then extracts blocks of text that contain a large number of stopwords and a low proportion of links. This text is likely to be the most important content. Because CFS automatically determines which content is relevant, there are no configuration parameters for customizing this task.

Micro Focus recommends that you configure the HtmlExtraction task as a Pre import task. For example:

[ImportTasks]
Pre0=HtmlExtraction

After extracting the useful information, the HTML Extraction task sets the document field AUTN_NO_FILTER, so that the HTML file is not processed by KeyView.