Remove Irrelevant Content
To remove irrelevant content from HTML pages using the automatic clipping algorithm, add the parameter Clipped=TRUE
to your task configuration. CFS decides which parts of the page to keep and which to discard.
The automatic clipping algorithm has been designed to work with many different pages, but this means that automatic clipping might not give the best results for every page. Alternatively, you can use CSS selectors to choose which parts of the page to keep and which to discard. To clip pages with CSS selectors, add Clipped=TRUE
to your task configuration, and then set ClipPageUsingCssSelect
to specify the parts of the page to keep and ClipPageUsingCssUnselect
to specify the parts of the page to remove. These parameters accept standard CSS2 selectors.
You can also remove scripts and hidden content from the HTML page:
- Remove all scripts from the HTML page by setting
RemoveScripts=TRUE
. - Remove "noframes" content by setting
RemoveNoframes=TRUE
. When web developers use frames they might include content in a<noframes></noframes>
element, for web browsers that do not support frames. This content might duplicate content elsewhere in the HTML page or simply contain a message that the browser does not support frames.