Clip Pages

The content on most web pages includes headers, footers, navigation bars, and advertisements. Unless these are removed from pages, the text from these items could reduce the quality of the documents indexed into IDOL Server and reduce the effectiveness of operations such as categorization.

You can configure HPE Web Connector to remove irrelevant content from pages before they are ingested.

Clip Pages Automatically

HPE Web Connector can clip pages automatically, using an algorithm to decide which parts of the page to keep and which to discard. To clip pages automatically, add Clipped=TRUE to your task configuration.

Clip Pages using CSS Selectors

The automatic clipping algorithm has been designed to work with many different pages, but this means that automatic clipping might not give the best results for every page. For this reason, you can use CSS selectors to choose which parts of the page to keep and which to discard. To clip pages with CSS selectors, add Clipped=TRUE to your task configuration, and then set the parameters ClipPageUsingCssSelect and ClipPageUsingCssUnselect.

ClipPageUsingCssSelect A comma-separated list of CSS2 selectors that specify parts of the page to keep. The connector also keeps all descendents of these elements.
ClipPageUsingCssUnselect A comma-separated list of CSS2 selectors that specify parts of the page to remove. The connector also removes all descendents of these elements. The ClipPageUsingCssSelect parameter is applied before ClipPageUsingCssUnslect, so you can use this parameter to remove unwanted descendants of elements identified by ClipPageUsingCssSelect.

The HPE Web Connector supports standard CSS2 selectors. To construct the selectors, view the source HTML of the pages that you need to clip. CSS allows you to select elements based on the structure of the page. For example, you can select elements of a certain type that are descendents of another element. Also, the designer of the page might have added classes to the relevant elements in order to style them, and you can use these same classes to clip the page.

The following example shows a simple page:

<html>
<head>
</head>
<body>
   <nav>
     <!-- navigation and links -->
   </nav>
   <div class="maincontent">
     <p>Some content</p>
   </div>
   <div class="footer">
     <!-- footer -->
   </div>
</body>
</html>

To select the main content but exclude the navigation element and the footer, you could use the following configuration:

[MyTask]
...
Clipped=TRUE
ClipPageUsingCssSelect=div.maincontent
ClipPageUsingCssUnselect=nav,div.footer

Tip: HPE Web Connector includes an example tool that can help you find the CSS selectors you need to clip web pages. For more information about this utility, see Find Selectors using the CSS Selector Builder Tool.


_HP_HTML5_bannerTitle.htm