Remove Irrelevant Content

The content on most web pages includes headers, footers, navigation bars, and advertisements. Unless these are removed from pages, the text from these items could reduce the quality of the documents indexed into IDOL Server and reduce the effectiveness of operations such as categorization.

You can use a feature called "clipping" to remove irrelevant content.

Clip Pages Automatically

Connector Framework Server can clip pages automatically, using one of the following algorithms to decide which parts of the page to keep and which to discard.

  • To use the Mozilla readability library, set ClippingMode to READABILITY.
  • To use the SmartPrint algorithm, set ClippingMode to SMARTPRINT. SmartPrint works best with common page designs, such as pages where the content is in the center and there are navigation panels to the top and left, with extra content to the right.

    The SmartPrint algorithm evaluates each section of the page and decides whether to clip it based on several factors, including:

    • The position of the section on the page (central content is preferred).
    • The ratio of links to words (a smaller proportion of links is preferred).

Clip Pages using CSS Selectors

The automatic clipping algorithms have been designed to work with many different pages, but this means that automatic clipping might not give the best results for every page. For this reason, you can use CSS selectors to choose which parts of the page to keep and which to discard. To clip pages with CSS selectors, set ClippingMode=CSSCLIPPING, and then set one or both of the parameters ClipPageUsingCssSelect and ClipPageUsingCssUnselect.

ClipPageUsingCssSelect A CSS selector to specify the parts of a page to keep when the page is clipped. CFS also keeps all descendents of these elements.
ClipPageUsingCssUnselect

A CSS selector to specify the parts of a page to remove when the page is clipped. CFS also removes all descendents of these elements.

The ClipPageUsingCssSelect parameter is applied before ClipPageUsingCssUnslect, so you can use this parameter to remove unwanted descendants of elements identified by ClipPageUsingCssSelect.

Connector Framework Server supports standard CSS selectors. To construct the selectors, view the source HTML of the pages that you need to clip. CSS allows you to select elements based on the structure of the page. For example, you can select elements of a certain type that are descendents of another element. Also, the designer of the page might have added classes to the relevant elements in order to style them, and you can use these same classes to clip the page.

The following example shows a simple page:

<html>
<head>
</head>
<body>
   <nav>
     <!-- navigation and links -->
   </nav>
   <div class="maincontent">
     <p>Some content</p>
   </div>
   <div class="footer">
     <!-- footer -->
   </div>
</body>
</html>

To select the main content but exclude the navigation element and the footer, you could use the following configuration:

[MyTask]
...
ClippingMode=CSSCLIPPING
ClipPageUsingCssSelect=div.maincontent
ClipPageUsingCssUnselect=nav,div.footer

Remove Scripts and Hidden Content

You can also remove scripts and hidden content from the HTML page:

  • Remove all scripts from the HTML page by setting RemoveScripts=TRUE.
  • Remove "noframes" content by setting RemoveNoframes=TRUE. When web developers use frames they might include content in a <noframes></noframes> element, for web browsers that do not support frames. This content might duplicate content elsewhere in the HTML page or simply contain a message that the browser does not support frames.