Use the IDOL Web Connector
The best way to ingest web content into IDOL is to use the IDOL Web Connector. As you might expect, this is one of the most widely used IDOL connectors. Read on for a step-by-step guide that demonstrates how to ingest information from a real site - Wikipedia.
Getting Started
Web Connector is included with IDOL versions 11.0 and later (but is not available on Solaris). If you have installed IDOL on Windows or Linux you might already have Web Connector.
Like all IDOL components, you configure the Web Connector by editing its configuration file. The configuration file is named webconnector.cfg
and is located in the webconnector
folder of your IDOL installation.
Open the configuration file and find the [License]
section. Check that the LicenseServerHost
and LicenseServerACIPort
parameters are set to the host name and ACI port of your IDOL License Server.
Next, scroll down to the [FetchTasks]
section. Rename the default fetch task from [MyTask]
to [Wikipedia]
because that is the site we want to crawl. You can give the task any name, but it is used in action commands and responses, and is written to the logs, so it helps to clearly describe the task.
[FetchTasks] SSLMethod=SSLV23 Number=1 0=Wikipedia [Wikipedia]
The [Wikipedia]
section is where you will add the configuration parameters that set up the task.
In an enterprise environment, you have probably installed the connector on a machine that is behind a proxy server. The connector has to send requests to the proxy, which makes them in place of the connector machine and returns the results. The easiest way to set the address of the proxy server is with the ProxyHost
and ProxyPort
parameters.
ProxyHost=proxy.example.com
ProxyPort=8080
Your environment might provide a proxy automatic configuration (.pac
) script, especially if your organization has complex requirements and many proxy servers. Instead of setting ProxyHost
and ProxyPort
, configure the connector to use the script by setting the parameter ProxyAutoConfigurationUrl
.
ProxyAutoConfigurationUrl=http://proxy.example.com/proxy.pac
Choose the Starting Point
The Web Connector can crawl a web site in several ways:
- The connector can start from a page and follow links to other pages, until it has exhausted all the links that satisfy your criteria. You must specify the URL of the page to start from using the
Url
parameter. - The connector can ingest all the pages on a sitemap. A sitemap is an XML document that contains a list of pages for web crawlers to retrieve. Using a site map is often the best option, if there is one, because the connector retrieves the pages suggested by the site administrator. A site map might include pages that are not accessible by following links, and using a site map also makes it easier to configure the fetch task. To ingest a site in this way, you must specify the URL of the site map using the
SitemapUrl
parameter. - The connector can ingest every page from a list of URLs. You must specify the URLs. This is not practical for large sites such as Wikipedia.
In this case we will configure the connector to follow links. Index pages are usually a good starting point, but entries in an encyclopedia contain many links to other entries, so you can probably start from any page and given sufficient time the connector will crawl the entire site.
Choose a page to start from and specify the URL by setting the Url
parameter.
Url=https://en.wikipedia.org/wiki/JSON
Wikipedia contains links to external sites, so unless you want to start crawling the entire web it is a good idea to restrict the connector to just one site.
StayOnSite=TRUE
When StayOnSite=TRUE
the connector does not follow links to other sites, including other sub-domains of the same site. For example, the connector does not follow links to https://fr.wikipedia.org
.
Select the Content to Crawl and Ingest
One of the most important (and difficult) parts of configuring your Web Connector is selecting what content to crawl and ingest into IDOL.
When you visit a website, you can quickly focus on whatever parts of the site you consider important, and ignore the rest. There are pages on Wikipedia that explain how to use the site or describe the history of the entries in the encyclopedia. It is best not to ingest these pages because the content will pollute the results of queries, and affect other IDOL operations that your users run.
There are also pages that you should prevent the connector from interacting with. It is important to be careful when crawling a site, because the connector can follow every hyperlink on a page. For example, if you are crawling Wikipedia you do not want the connector to follow "Edit" links and edit pages. If you are crawling an e-commerce site to run sentiment analysis on product reviews, you do not want the connector to follow the "Add to cart" links and try to buy everything.
Fortunately, the Web Connector has configuration parameters that you can set to meet these requirements. The connector follows the robot exclusion standard by default, but you can configure this by setting the parameter FollowRobotProtocol
. This is a standard used by many web sites to instruct web crawlers to ignore certain pages. The rules defined in a site's robots.txt
might exclude some of the pages that you want to avoid and following the standard is recommended in most cases.
In most cases you also need to exclude further pages, which you can do using the parameters SpiderUrlMustHaveRegex
and SpiderUrlCantHaveRegex
. The connector only crawls and ingests a page if its full URL matches the regular expression set by the parameter SpiderUrlMustHaveRegex
, and does not match the regular expression set by the parameter SpiderUrlCantHaveRegex
.
It is often much easier to set these parameters if you consult with the administrator of the site that you are ingesting. They might be able to reveal an easy way to filter out the pages that you do not want.
Here are the URLs of some example entries on Wikipedia:
https://en.wikipedia.org/wiki/JSON https://en.wikipedia.org/wiki/JavaScript https://en.wikipedia.org/wiki/Hypertext_Transfer_Protocol https://en.wikipedia.org/wiki/Array_data_structure
These URLs are examples of pages that you want to exclude:
https://en.wikipedia.org/w/index.php?title=List_of_HTTP_header_fields&action=edit https://en.wikipedia.org/w/index.php?title=Computer&action=history https://en.wikipedia.org/wiki/Help:Contents https://en.wikipedia.org/wiki/Wikipedia:About
The URLs for the encyclopedia entries all look similar, so you can write a regular expression to match these, but not match links to the edit and history pages. However, some of the other pages that we want to ignore have URLs that are similar to the encyclopedia articles. In this case, disallowing the colon character (:
) should exclude these pages. There are regular expression testers on the web that you can use to help build and test your regular expressions.
SpiderUrlMustHaveRegex=https://en\.wikipedia\.org/wiki/[A-Za-z0-9_]*
This regular expression is relatively restrictive, but it is much safer to start with something like this and miss pages that you want to ingest, than to start with a regular expression that matches too much. With the SpiderUrlMustHaveRegex
parameter set in this way the connector only crawls and ingests pages with URLs that start with https://en.wikipedia.org/wiki/
, followed by zero or more characters that match an upper or lower case letter, a number, or an underscore.
You can use the parameters LinkElementMustHaveRegex
and LinkAttributes
to follow only certain types of hyperlinks (for example <a href="...">
) in the page source. LinkElementMustHaveRegex
is set as a regular expression and LinkAttributes
accepts a comma-separated list of attribute names.
LinkElementMustHaveRegex=^a$ LinkAttributes=href
The connector can also accept or reject pages based on the Content-Type
header that is returned with the page. This a good way to filter out certain types of content, such as scripts. The ContentTypeCantHaveRegex
and ContentTypeMustHaveRegex
parameters are also set as regular expressions. The following example is a good starting point that attempts to filter out stylesheets, XML, and javascript files:
ContentTypeCantHaveRegex=(application|text)/(javascript|xml|x-javascript|css)(;.*)?
The connector restricts the pages that are ingested based on page size and number of links, but some of the encyclopedia entries are quite long and link to many other pages. So, in this case, remove the limits on page size and the maximum number of links by setting these parameters to zero:
MinPageSize=0 MaxPageSize=0 MaxLinksPerPage=0
It is usually a good idea to remove the content of <script>
and <noscript>
elements from pages before they are ingested. This does not affect the behavior of Web Connector as it crawls the site, but removing this content improves the quality of the information that is indexed into IDOL.
RemoveNoscripts=TRUE RemoveScripts=TRUE
Removing <script>
elements improves the quality of the information because scripts do not contain any conceptual information that IDOL can use. Your end-users are interested in the content of a page, not the scripts used to generate it. The <noscript>
element might contain duplicate content or just a statement such as "Your browser does not support JavaScript".
For a full list of the configuration parameters you can use to select or reject content, refer to the Web Connector Reference.
Consider Other Users
Before crawling a Web site, consider the impact the connector will have. Avoid putting too much load on the target site, because this could degrade the experience of other users. If you ingest content too rapidly, the site administrator might decide to block your requests in future - and if your connector makes requests through a company proxy, your entire organization could be blocked.
By default, the connector uses five threads for running a synchronize task. This means that it can request five pages at once. It also processes those pages and follows links much faster than a human visitor. The number of threads to use is specified by the parameter SynchronizeThreads
. If you want the connector to wait between requests, you can set the PageDelay
parameter. For example, setting PageDelay=1s
leaves a second between each request, but because the connector is using five threads this results in up to five requests per second.
SynchronizeThreads=5 PageDelay=1s
Set Parameters to Help in Testing
You can add some parameters to the task configuration to help as you set up the connector.
You probably do not want to wait for the connector to crawl all of Wikipedia just to see whether the configuration is reasonable, so restrict the number of pages that are processed by setting the parameters Depth
and MaxPages
. Depth
specifies the number of links to follow from the starting point. For example, if you set Depth=2
the connector can ingest any page that can be reached by clicking two links from the starting point. MaxPages
specifies the maximum number of pages to ingest each time you run the synchronize task.
Depth=2 MaxPages=50
It is also useful to see the pages that the connector is ingesting. You can achieve this by setting IngestKeepFiles=TRUE
. This instructs the connector to keep the pages it has downloaded in its temporary folder, rather than deleting them as soon as they have been ingested.
You probably do not want to keep track of the pages that are processed while you are optimizing the configuration, so set SynchronizeKeepDatastore=FALSE
, which instructs the connector to delete the datastore at the end of the synchronize task. In a production environment, set this to TRUE
, so that the connector keeps track of the work it has done and does not ingest the same pages again unless they have changed.
You can configure logging for Web Connector in the same way as other IDOL components, by modifying the settings in the [Logging]
section of the configuration file. The connector writes messages about the pages it crawls to the synchronize
log stream. The default configuration saves these messages to a file named synchronize.log
, in the logs
directory. While configuring the connector it helps to have as much information as possible, so set the parameter LogLevel
to Full
.
[Logging] LogLevel=Full
This produces verbose logs. In production you can optimize performance by reducing the log level to Normal
.
Check your Configuration
The completed fetch task now looks something like this:
[Wikipedia] ProxyHost=proxy.example.com ProxyPort=8080 Url=https://en.wikipedia.org/wiki/JSON StayOnSite=TRUE FollowRobotProtocol=TRUE SpiderUrlMustHaveRegex=https://en\.wikipedia\.org/[A-Za-z0-9_]* LinkElementMustHaveRegex=^a$ LinkAttributes=href ContentTypeCantHaveRegex=(application|text)/(javascript|xml|x-javascript|css)(;.*)? MinPageSize=0 MaxPageSize=0 MaxLinksPerPage=0 RemoveNoscripts=TRUE RemoveScripts=TRUE SynchronizeThreads=5 PageDelay=1s // Increase or remove depth limit before going into production Depth=2 // Remove these settings before going into production MaxPages=50 IngestKeepFiles=TRUE SynchronizeKeepDatastore=FALSE
Start the Connector
The next step is to start the connector. Start the connector in the same way as other IDOL components, by running the executable file or starting its service. For more information about starting and stopping IDOL components, refer to the IDOL Getting Started Guide.
Unless you have altered the default schedule defined the configuration file, the connector starts running the task immediately. If you have disabled scheduled tasks, start the task using the fetch
action:
http://host:7006/action=fetch&fetchaction=synchronize&tasksections=Wikipedia
To discover the status of the task, use the QueueInfo
action:
http://host:7006/action=queueinfo&queueaction=getstatus&queuename=fetch
When the task has finished the response looks similar to this:
<autnresponse>
<action>QUEUEINFO</action>
<response>SUCCESS</response>
<responsedata>
<actions>
<action>
<status>Finished</status>
<queued_time>2016-Nov-18 14:22:27</queued_time>
<time_in_queue>0</time_in_queue>
<process_start_time>2016-Nov-18 14:22:27</process_start_time>
<time_processing>94</time_processing>
<process_end_time>2016-Nov-18 14:24:01</process_end_time>
<documentcounts>
<documentcount added="50" errors="0" ingestadded="50" seen="58" task="WIKIPEDIA"/>
</documentcounts>
<fetchaction>SYNCHRONIZE</fetchaction>
<tasks>
<success>WIKIPEDIA</success>
</tasks>
<tasksection>Wikipedia</tasksection>
<token>...</token>
</action>
</actions>
</responsedata>
</autnresponse>
The task has ingested 50 pages because we set MaxPages=50
.
Optimize the Configuration
After the fetch task has finished, stop the connector and examine the synchronize log. You can find the log files in the connector's logs
directory.
Search the synchronize log for WKOOP:Loading Page
to find the pages the connector has processed. These messages are logged only when LogLevel=Full
.
WKOOP:Loading Page: https://en.wikipedia.org/wiki/JSON WKOOP:Loading Page: https://en.wikipedia.org/wiki/Array_data_structure WKOOP:Loading Page: https://en.wikipedia.org/wiki/Base64 WKOOP:Loading Page: https://en.wikipedia.org/wiki/Computing WKOOP:Loading Page: https://en.wikipedia.org/wiki/Database
Similarly, search for Url rejected
to find the pages that the connector has rejected (shown here over multiple lines for readability), and the reasons why they were rejected:
Url rejected due to robot protocol: https://en.wikipedia.org/w/index.php?title=JSON&action=edit§ion=11 Url rejected due to robot protocol: https://en.wikipedia.org/w/index.php?title=Special:CreateAccount&returnto=JSON Url rejected due to SpiderUrlMustHaveRegex: https://en.wikipedia.org/wiki/Category:JSON Url rejected due to SpiderUrlMustHaveRegex: https://en.wikipedia.org/wiki/Atom_(standard) Url rejected due to SpiderUrlMustHaveRegex: https://en.wikipedia.org/wiki/Attribute–value_pair
We want to reject some of these pages but the last two have been rejected unnecessarily. At this point you can go back to the configuration file and modify the task, particularly the parameters that specify which pages to crawl and ingest. For example, you might allow additional characters at the end of the URL (after /wiki/
), but still exclude the colon character:
SpiderUrlMustHaveRegex=https://en\.wikipedia\.org/wiki/[^:]*
If you start the connector and run the task again, you should find that the connector processes pages that were rejected before. The following page was rejected because its URL contains parentheses, but it is now crawled and ingested:
WKOOP:Loading Page: https://en.wikipedia.org/wiki/Atom_(standard)
In the task configuration, we set IngestKeepFiles=TRUE
, so look in the connector's temp
directory to see the pages that were downloaded.
Conclusion
Hopefully this guide has provided some useful insights into a real configuration. The [FetchTasks]
and task sections from the completed configuration are included below:
[FetchTasks] SSLMethod=SSLV23 Number=1 0=Wikipedia [Wikipedia] ProxyHost=proxy.example.com ProxyPort=8080 Url=https://en.wikipedia.org/wiki/JSON StayOnSite=TRUE FollowRobotProtocol=TRUE SpiderUrlMustHaveRegex=https://en\.wikipedia\.org/[^:]* LinkElementMustHaveRegex=^a$ LinkAttributes=href ContentTypeCantHaveRegex=(application|text)/(javascript|xml|x-javascript|css)(;.*)? MinPageSize=0 MaxPageSize=0 MaxLinksPerPage=0 RemoveNoscripts=TRUE RemoveScripts=TRUE SynchronizeThreads=5 PageDelay=1s Depth=5
In the finished configuration, the Depth
parameter has been increased to 5
. The MaxPages
parameter has been removed, so the connector uses the default maximum of 2000
pages. If you are crawling a large site and want to retrieve every page you can set this parameter to a higher value.
There are several things you could do to make this configuration even better, for example:
- Configure clipping, to improve the quality of the information that the connector indexes into IDOL.
- Configure thumbnail rendering, so that front-end applications can display a thumbnail image of the page when displaying query results.