Authenticate with a Web Site

Some Web sites require that you log on before you can access content. To retrieve content from these Web sites, configure the connector to log on to the site.

Basic, Digest, and NTLMv2

To log on to a Web site that uses HTTP Basic, HTTP Digest, or NTLM version 2 authentication, set the configuration parameters AuthUser and AuthPassword. You can also set AuthUrlRegex if you want to restrict authentication to specific URLs. For example:

[MyTask]
Url=https://www.example.com/
AuthUrlRegex=.*www\.example\.com/.*
AuthUser=user
AuthPassword=password

To set the authentication parameters in a different section of the configuration file, set the AuthSection parameter.

[MyTask]
Url=https://www.example.com/
AuthSection=MyAuthSection
			
[MyAuthSection]
AuthUrlRegex=.*www\.example\.com/.*
AuthUser=user
AuthPassword=password

The AuthSection parameter accepts multiple values. This can be useful if you create a task to crawl more than one site and you need to specify more than one set of credentials. You can set the AuthUrlRegex parameter in each section to specify the URLs that the authentication details can be used against. For example:

[MyTask]
Url=https://www.example.com/
AuthSection0=AuthExample 
AuthSection1=AuthExampleSubDomain 

[AuthExample] 
AuthUrlRegex=.*www\.example\.com/.* 
AuthUser=MyUsername 
AuthPassword=MyP4ssw0rd

[AuthExampleSubDomain]
AuthUrlRegex=.*subsite\.example\.com/.* 
AuthUser=AnotherUsername 
AuthPassword=AnotherP4ssw0rd

Kerberos

The connector supports Kerberos authentication on both Windows and Linux.

NOTE: Kerberos authentication is not supported on the RHEL7FIPS platform.

To configure Kerberos authentication (Web Connector on Windows)

  • Set the AuthUser and AuthPassword configuration parameters in the task section of your configuration file.

To configure Kerberos authentication (Web Connector on Linux)

  1. On a Windows server, generate a .keytab file for the domain user who you want to authenticate as. Run the following command, where <user> is the user name, <domain> is the domain, and <password> is the password.

    ktpass -out <user>.keytab -princ <user>@<domain> -mapUser <user> -mapOp set -pass <password> -crypto all -ptype KRB5_NT_PRINCIPAL
  2. Copy the .keytab file to the Linux server that hosts Web Connector.
  3. Install the Kerberos client packages. For example, on CentOS:

    sudo yum install krb5-workstation krb5-libs
  4. Run kinit, using the .keytab file that you generated earlier.

    kinit -k -t /home/<user>/<user>.keytab -c /home/<user>/krbcache <user>@<domain>
  5. Open the file WKOOP/WKOOP.cfg, in the Web Connector installation directory, and set the parameter IWAServerAllowedList. The value of this parameter must match the server name of the site that you are indexing, for example:

    [BrowserLibrary]
    IWAServerAllowedList=*.example.com

    Alternatively, you can set IWAServerAllowedList=* to match all hosts.

  6. Open the Web Connector configuration file and ensure that the AuthUser and AuthPassword parameters are not set.
  7. Set the KRB5CCNAME environment variable to the path of the Kerberos cache that you generated and start Web Connector:

    KRB5CCNAME=FILE:/home/<user>/krbcache ./WebConnector.exe

Submit a Form

The connector might be able to log on to a site by submitting a form.

Configure the connector to submit a form by setting the following configuration parameters:

CookieURLs A comma-separated list of URLs to visit to obtain cookies. Some sites redirect the connector to the login form, and then back to the target page after the form has been submitted. If the site does not do this, you can set this parameter to ensure the connector authenticates successfully.
FormUrlRegex

A regular expression to identify the page that contains the log-on form. The connector does not attempt to submit form data unless the URL of a page matches the regular expression.

InputSelector A list of CSS selectors to identify the form fields to populate. Specify the selectors in a comma-separated list or by using numbered parameters.
InputValue The values to use for the form fields specified by the InputSelector parameter. Specify the values in a comma-separated list or by using numbered parameters.
SubmitSelector A CSS selector that identifies the form element to use to submit the form.
ValidateFormData A Boolean value that specifies whether the connector attempts to validate the data supplied to complete a form. The connector can validate the data based on the types of the input elements.

You can set these parameters in the [TaskName] section of the configuration file, for example:

[MyTask]
Url=https://www.example.com/
CookieURLs=https://www.example.com/login.php
FormUrlRegex=.*login\.php
InputSelector0=input[name=username]
InputSelector1=input[name=password]
InputValue0=MyUsername
InputValue1=MyP4ssw0rd!
SubmitSelector=input[name=login]

To specify the information in a separate section of the configuration file, set the FormsSection parameter:

[MyTask]
Url=https://www.example.com/
FormsSection=LogOnForm

[LogOnForm]
CookieURLs=https://www.example.com/login.php
FormUrlRegex=.*login\.php
InputSelector0=input[name=username]
InputSelector1=input[name=password]
InputValue0=MyUsername
InputValue1=MyP4ssw0rd!
SubmitSelector=input[name=login]

To submit different forms during a single task, you can create multiple sections containing form settings. The FormsSection parameter accepts multiple values. In each section, use the FormUrlRegex parameter to identify the page that contains the form:

[MyTask]
Url=https://www.example.com/
FormsSection0=LogOnFormExample 
FormsSection1=LogOnFormExampleSubDomain 

[LogOnFormExample] 
FormUrlRegex=.*www\.example\.com/.*login\.php 
InputSelector0=input[name=username]
InputSelector1=input[name=password]
InputValue0=MyUsername
InputValue1=MyP4ssw0rd!
SubmitSelector=input[name=login]

[LogOnFormExampleSubDomain] 
FormUrlRegex=.*subsite\.example\.com/.*login\.php 
InputSelector0=input[name=username]
InputSelector1=input[name=password]
InputValue0=MySubsiteUsername
InputValue1=MySubsiteP4ssw0rd!
SubmitSelector=input[name=login]