Remove Irrelevant Content
The content on most web pages includes headers, footers, navigation bars, and advertisements. Unless these are removed from pages, the text from these items could reduce the quality of the documents indexed into IDOL Server and reduce the effectiveness of operations such as categorization.
You can use a feature called "clipping" to remove irrelevant content.
Clip Pages Automatically
Connector Framework Server can clip pages automatically, using one of the following algorithms to decide which parts of the page to keep and which to discard.
- To use the Mozilla readability library, set
ClippingMode
toREADABILITY
. -
To use the SmartPrint algorithm, set
ClippingMode
toSMARTPRINT
. SmartPrint works best with common page designs, such as pages where the content is in the center and there are navigation panels to the top and left, with extra content to the right.The SmartPrint algorithm evaluates each section of the page and decides whether to clip it based on several factors, including:
- The position of the section on the page (central content is preferred).
- The ratio of links to words (a smaller proportion of links is preferred).
Clip Pages using CSS Selectors
The automatic clipping algorithms have been designed to work with many different pages, but this means that automatic clipping might not give the best results for every page. For this reason, you can use CSS selectors to choose which parts of the page to keep and which to discard. To clip pages with CSS selectors, set ClippingMode=CSSCLIPPING
, and then set one or both of the parameters ClipPageUsingCssSelect
and ClipPageUsingCssUnselect
.
ClipPageUsingCssSelect
|
A CSS selector to specify the parts of a page to keep when the page is clipped. CFS also keeps all descendents of these elements. |
ClipPageUsingCssUnselect
|
A CSS selector to specify the parts of a page to remove when the page is clipped. CFS also removes all descendents of these elements. The |
Connector Framework Server supports standard CSS selectors. To construct the selectors, view the source HTML of the pages that you need to clip. CSS allows you to select elements based on the structure of the page. For example, you can select elements of a certain type that are descendents of another element. Also, the designer of the page might have added classes to the relevant elements in order to style them, and you can use these same classes to clip the page.
The following example shows a simple page:
<html> <head> </head> <body> <nav> <!-- navigation and links --> </nav> <div class="maincontent"> <p>Some content</p> </div> <div class="footer"> <!-- footer --> </div> </body> </html>
To select the main content but exclude the navigation element and the footer, you could use the following configuration:
[MyTask] ... ClippingMode=CSSCLIPPING ClipPageUsingCssSelect=div.maincontent ClipPageUsingCssUnselect=nav,div.footer
Remove Scripts and Hidden Content
You can also remove scripts and hidden content from the HTML page:
- Remove all scripts from the HTML page by setting
RemoveScripts=TRUE
. - Remove "noframes" content by setting
RemoveNoframes=TRUE
. When web developers use frames they might include content in a<noframes></noframes>
element, for web browsers that do not support frames. This content might duplicate content elsewhere in the HTML page or simply contain a message that the browser does not support frames.