wkoop_html_processing

The function wkoop_html_processing processes HTML from the file associated with a document.

You can use this function to:

  • extract all of the links from the HTML file and add them to the document metadata.
  • extract metadata from the HTML <meta...> tags.
  • clip the HTML page (remove unwanted parts of the page).
  • extract the text from the HTML page and use it to populate the document's content field.

Syntax

wkoop_html_processing(document, section, params)

Arguments

Argument Description
document (LuaDocument) The document that has an associated HTML file.
section (string) The name of a section in the CFS configuration file that contains WKOOP HTML processing settings.
params (table) A table of named parameters to configure WKOOP HTML processing. The table maps parameter names (String) to parameter values. For information about the parameters that you can set, see the following table.

Named Parameters

Named Parameter Description Configuration Parameter
section (string) The name of a section in the CFS configuration file. If you set this then any parameters not set in the parameters table are read from this section of the configuration file.  
url (string, default data:text/html) The URL for the HTML page contained in the HTML file. This URL is used to resolve relative URLs. Url
clipped (Boolean, default FALSE) Specifies whether to clip HTML pages. Clipping removes uninteresting parts of a page. To specify the parts of pages to keep and remove, set the parameters clip_page_using_css_select and clip_page_using_css_unselect. If you do not set these parameters WKOOP HTML processing uses an algorithm to decide which parts of the page to keep. Clipped
clip_page_using_css_select (string) A comma-separated list of CSS2 selectors to specify the parts of a page to keep when the page is clipped. ClipPageUsingCssSelect
clip_page_using_css_unselect (string) A comma-separated list of CSS2 selectors to specify the parts of a page to remove when the page is clipped. ClipPageUsingCssUnselect
wkoop_path (string, default WKOOP.exe) The path to the WKOOP executable file. WKOOPPath
temp_directory (string, default temp) The path of the temporary directory to use for the task.  
extract_links (Boolean, default TRUE) Specifies whether to extract links from the page and add them to the document metadata. ExtractLinks
extract_html_meta (Boolean, default TRUE) Specifies whether to extract information from the meta tags in HTML documents. An example meta tag is <meta name="..." content="..." />. The name attribute is used as the document field name and the content attribute is used as the field value. ExtractHtmlMeta

Returns

(String table). The text (or clipped text) that is extracted from the HTML is saved to a file. The Lua function returns a table that maps the type of output to the output file path. The type of output can be either txt or clipped_txt, depending on whether you have enabled clipping. CFS removes the file after the document has been processed, so do not attempt to reference the file outside of the handler function.

You can write your Lua script so that it uses the path returned in the table to read the file and add the text to the document content. In this case you might also want to set the field AUTN_NO_FILTER, so that the document is not processed by KeyView.

Extracted links and metadata are added directly to the document metadata.

Example

The following Lua script processes HTML using the settings in the [HtmlProcessingSettings] section of the CFS configuration file.

function handler(document)
    local results = wkoop_html_processing(document, "HtmlProcessingSettings",
       { section="HtmlProcessingSettings" })
    local clippedTxtFile = results["clipped_txt"]
    
    local fh = io.open(clippedTxtFile, "r")
    local file_content = fh:read("*all")
    fh:close()

    document:setContent(file_content)
    document:addField( "AUTN_NO_FILTER", "SET" )
    return true
end