wkoop_html_processing
The function wkoop_html_processing
processes HTML from the file associated with a document.
NOTE: This function does not create or return child documents.
You can use this function to:
- extract all of the links from the HTML file and add them to the document metadata.
- extract metadata from the HTML
<meta...>
tags. - clip the HTML page (remove unwanted parts of the page).
- extract the text from the HTML page and use it to populate the document's content field.
Syntax
wkoop_html_processing(document, section, params)
Arguments
Argument | Description |
---|---|
document
|
(LuaDocument) The document that has an associated HTML file. |
section | (string) The name of a section in the CFS configuration file that contains WKOOP HTML processing settings. |
params
|
(table) A table of named parameters to configure WKOOP HTML processing. The table maps parameter names (String) to parameter values. For information about the parameters that you can set, see the following table. |
Named Parameters
Named Parameter | Description | Configuration Parameter |
---|---|---|
section
|
(string) The name of a section in the CFS configuration file. If you set this then any parameters not set in the parameters table are read from this section of the configuration file. | |
url
|
(string, default data:text/html ) The URL for the HTML page contained in the HTML file. This URL is used to resolve relative URLs. |
Url |
clipping_mode
|
(string) Specifies the clipping mode. Clipping removes uninteresting parts of a page. You can set this parameter to "NONE", "SMARTPRINT", "CSSCLIPPING", or "READABILITY". | ClippingMode |
clip_page_using_css_select
|
(string) A CSS selector to specify the parts of a page to keep when the page is clipped with CSS clipping. | ClipPageUsingCssSelect |
clip_page_using_css_unselect
|
(string) A CSS selector to specify the parts of a page to remove when the page is clipped with CSS clipping. | ClipPageUsingCssUnselect |
wkoop_path
|
(string, default WKOOP.exe ) The path to the WKOOP executable file. |
WKOOPPath |
temp_directory
|
(string, default temp ) The path of the temporary directory to use for the task. |
|
extract_links
|
(Boolean, default TRUE ) Specifies whether to extract links from the page and add them to the document metadata. |
ExtractLinks |
extract_html_meta
|
(Boolean, default TRUE ) Specifies whether to extract information from the meta tags in HTML documents. An example meta tag is <meta name="..." content="..." /> . The name attribute is used as the document field name and the content attribute is used as the field value. |
ExtractHtmlMeta |
Returns
(String table). The text (or clipped text) that is extracted from the HTML is saved to a file. The Lua function returns a table that maps the type of output to the output file path. The type of output can be either txt
or clipped_txt
, depending on whether you have enabled clipping. CFS removes the file after the document has been processed, so do not attempt to reference the file outside of the handler
function.
You can write your Lua script so that it uses the path returned in the table to read the file and add the text to the document content. In this case you might also want to set the field AUTN_NO_FILTER
, so that the document is not processed by KeyView.
Extracted links and metadata are added directly to the document metadata.
Example
The following Lua script processes HTML using the settings in the [HtmlProcessingSettings]
section of the CFS configuration file.
function handler(document) local results = wkoop_html_processing(document, "HtmlProcessingSettings", { section="HtmlProcessingSettings" }) local clippedTxtFile = results["clipped_txt"] local fh = io.open(clippedTxtFile, "r") local file_content = fh:read("*all") fh:close() document:setContent(file_content) document:addField( "AUTN_NO_FILTER", "SET" ) return true end