HTML pages often contain irrelevant content such as invalid HTML, headers, sidebars, advertisements, and scripts. HPE CFS can extract the useful information from the page and discard the irrelevant content.
To extract the useful information from an HTML page, use the HtmlExtraction
import task. This task works only on HTML files and ignores other file types.
HPE CFS reads the HTML document, and discards data such as invalid HTML, headers, sidebars, advertisements, and scripts. In the remaining content, HPE CFS then extracts blocks of text that contain a large number of stopwords and a low proportion of links. This text is likely to be the most important content. Because HPE CFS automatically determines which content is relevant, there are no configuration parameters for customizing this task.
HPE recommends that you configure the HtmlExtraction
task as a Pre import task. For example:
[ImportTasks] Pre0=HtmlExtraction
After extracting the useful information, the HTML Extraction task sets the document field AUTN_NO_FILTER, so that the HTML file is not processed by KeyView.
|