ContentFromHTML

Connectors, including the IDOL Web Connector, can send documents for ingestion that have associated HTML files.

You could send the HTML files to the KeyViewFilterDocument processor, which discards the HTML markup and extracts the text contained in the file. However, HTML pages often contain irrelevant content such as invalid HTML, headers, sidebars, advertisements, and scripts. This text does not contain any useful information and could pollute the IDOL index, degrading performance. KeyView does not remove this irrelevant content.

The ContentFromHTML processor uses an embedded browser to process HTML in a similar way to the IDOL Web Connector. There are many reasons to use this processor over other methods of processing HTML:

  • The browser allows scripts to run before the page is processed, so the processor can extract content and links that are added by JavaScript.
  • Links are resolved before a document is ingested, so that indexed documents contain absolute URLs.
  • You can remove unwanted content using the automatic clipping algorithm, or by selecting parts of the page with CSS selectors.
  • You can extract metadata or divide pages into multiple documents using CSS selectors rather than regular expressions.

Properties

Name Default Value Description
IDOL License Service  

An IdolLicenseServiceImpl that provides a way to communicate with an IDOL License Server.

Document Registry Service   A DocumentRegistryServiceImpl controller service that manages and updates a document registry database. This ensures that documents are indexed in the correct order.
WKOOP Path   The path to the embedded browser (WKOOP) executable file.
Url data:text/html,

The source URL of the HTML content.

Specify a URL if you want to resolve links into absolute URLs, or if external resources are required to process the page - for example if external JavaScripts must run before the page is processed. You do not need to specify the exact URL of the page being processed, as long as all URLs in the document being processed are absolute or relative to the web server.

You can extract the value from a FlowFile attribute using NiFi expression language, for example ${idol.reference}

Clipped false

Specifies whether to clip web pages. Clipping removes uninteresting parts of a page such as advertisements. To clip pages, set this property to true.

To specify the parts of pages to keep and remove, set the properties Clip Page Using CSS: Select and Clip Page Using CSS: Unselect. If you do not set these properties the processor uses an algorithm to decide which parts of the page to keep.

Clip Page Using CSS: Select   A comma-separated list of CSS selectors to specify the parts of a page to keep when the page is clipped. The processor also keeps all descendants of these elements.
Clip Page Using CSS: Unselect  

A comma-separated list of CSS selectors to specify the parts of a page to remove when the page is clipped. The processor also removes all descendants of these elements.

The Clip Page Using CSS: Select property is applied before Clip Page Using CSS: Unselect, so you can use this property to remove unwanted descendants of elements identified by Clip Page Using CSS: Select.

Clip CSS select on failure SelectAll Specifies what to do when the CSS selector specified by Clip Page Using CSS: Select does not match any elements on the page. "Fail" means that processing fails and the page is not ingested. "SelectAll" means that the page is not clipped and all of the page content is ingested.
Clip CSS unselect on failure SelectNone Specifies what to do when the CSS selector specified by Clip Page Using CSS: Unselect does not match any elements on the page. "Fail" means that processing fails and the page is not ingested. "SelectNone" means that clipping removes any element not selected by Clip Page Using CSS: Select.
Extract Links true Specifies whether to extract links from pages and add the links to the document metadata.
Extract HTML Meta true Specifies whether to extract information from the meta tags in HTML documents.
Page timeout 120s The maximum amount of time to spend processing a page. Specify a time duration, for example "15 seconds".
WKOOP parameters   The remaining configuration parameters are passed to WKOOP for HTML processing. For more information about these parameters, right-click the processor and click View Usage, or refer to the documentation for the IDOL Web Connector.

Relationships

Name Description
success Successfully processed FlowFiles are routed to this relationship.
failure FlowFiles that had an invalid or unknown format.
extracted Child documents extracted from a HTML document. This relationship receives documents when you set options such as ChildDocumentSelector in the configuration passed to WKOOP.