HTML Extraction
HTML pages often contain irrelevant content such as invalid HTML, headers, sidebars, advertisements, and scripts. CFS can extract the useful information from the page and discard the irrelevant content.
To extract the useful information from an HTML page, use the HtmlExtraction
import task. This task works only on HTML files and ignores other file types.
CFS reads the HTML document, and discards data such as invalid HTML, headers, sidebars, advertisements, and scripts. In the remaining content, CFS then extracts blocks of text that contain a large number of stopwords and a low proportion of links. This text is likely to be the most important content. Because CFS automatically determines which content is relevant, there are no configuration parameters for customizing this task.
OpenText recommends that you configure the HtmlExtraction
task as a Pre import task. For example:
[ImportTasks] Pre0=HtmlExtraction
After extracting the useful information, the HTML Extraction task sets the document field AUTN_NO_FILTER, so that the HTML file is not processed by KeyView.