Scripted Processing

The scripted processing features of IDOL Web Connector allow you to control the embedded browser while specific pages are being processed. The browser is controlled by means of a custom Lua script.

A common feature on many web sites is paging. Imagine a site presents a list of items but there are too many to show on a single page, so the site presents the first ten and provides a link to the next page. When you click "next", some web sites reload the page. Your browser sends a new request to the web server, perhaps with an additional query parameter that specifies the current position in the list, so that the server returns the next ten items. After clicking "next" a few times, you might see page URLs like:

http://www.example.com/articles.htm
http://www.example.com/articles.htm?start=10
http://www.example.com/articles.htm?start=20
http://www.example.com/articles.htm?start=30

The IDOL Web Connector considers each of these to be separate pages, so each is crawled for links. The connector follows the link to the next page, and the following one, and so on, creating a separate IDOL document for each page.

Some web sites take a different approach and present the next "page" of content without reloading the page. The "next" button probably isn't a hyperlink at all. When you click the button, an event handler might run some JavaScript that performs an HTTP request and inserts new content into the existing page. No navigation takes place and the page URL does not change. Sometimes these sites might be referred to as using Asynchronous JavaScript and XML (AJAX) techniques.

On sites like these, automated web crawlers might fail to retrieve all of the content. You can use the scripted processing features of IDOL Web Connector to simulate clicking the "next" button and instruct the connector to ingest the page after each click.

To configure scripted processing, set the configuration parameters ScriptedProcessingUrlRegex and ScriptedProcessingLuaScript. ScriptedProcessingUrlRegex identifies the pages on which to use scripted processing. ScriptedProcessingLuaScript specifies the path of a Lua script to use for controlling the embedded browser. For example:

[MyTask]
ScriptedProcessingSections=ScriptedProcessing

[ScriptedProcessing]
ScriptedProcessingUrlRegex=.*/js_paged_index.html
ScriptedProcessingLuaScript=js_paged_index.lua

The Lua script that you write must define a function with the signature processPage(url, session). The url argument is a string, containing the URL of the current page. The session argument is a LuaClientSession object that provides methods for interacting with the page.

The following is an example implementation that performs the following steps:

  1. Creates a variable to contain the page number and sets this to 1.
  2. Ingests the page.
  3. Enters a loop that performs the following steps until the "next" button is disabled:

    1. Increments the page number.
    2. Clicks the "next" button.
    3. Waits until downloads have finished and the DOM stops changing.
    4. Ingests the page.
Copy
g_nextButtonCssSelector = "#next-button"
g_quietIntervalMs = 250
g_quietTimeoutMs = 10000

function nextPageButtonActive(taskLog, session)
    local elementsCount = session:countElements(g_nextButtonCssSelector .. ":not([disabled])")
    return elementsCount > 0;
end

function processPage(url, session)
    local taskLog = get_task_log()
    taskLog:full("Scripted Processing: " .. url)

    local pageNumber = 1

    taskLog:full("Snapshotting Page: " .. pageNumber)
    session:snapshotPage("Page" .. tostring(pageNumber))

    while nextPageButtonActive(taskLog, session) do
        pageNumber = pageNumber + 1
        session:clickElement(g_nextButtonCssSelector)
        session:waitForQuiet(g_quietIntervalMs, g_quietTimeoutMs)

        taskLog:full("Snapshotting Page: " .. pageNumber)
        session:snapshotPage("Page" .. tostring(pageNumber))
    end

    taskLog:full("Scripted Processing complete: " .. tostring(pageNumber) .. " pages")
end