Train Passage Extractor Classifiers
The Answer Server Passage Extractor uses a question classifier to determine what type a question is, and therefore what entities (if any) to extract from candidate answers. The type refers to the type of information that the question is requesting. For example, the question How many points make up a perfect fivepin bowling score? is looking for a number, while the question What is an annotated bibliography? is looking for a description.
The question classifier is always required. The Passage Extractor system does not return any answers without it.
The Answer Server installation includes classifiers for the English and German languages. For information about configuring which classifier to use, see Configure the Passage Extractor System. If the default classifier does not perform well for your use case, or you want to use Passage Extractor with other languages, you can train your own classifier.
The following sections provide more information about how to create and train your own classifiers.
Create a Training File
To train a question classifier, first create a training file to describe the kind of question classifications that you expect to send to your Passage Extractor. Each line of the training file defines a label and an example question, in the following format:
Label;Example Question
The example questions are the training. The label specifies the kind of information that the question is requesting. For example, the first few lines of the training file might be:
DESC:desc;What did the only repealed amendment to the U.S. Constitution deal with? NUM:count;How many points make up a perfect fivepin bowling score? DESC:def;What is an annotated bibliography? NUM:date;What is the date of Boxing Day?
The default training file uses a Text Retrieval Conference (TREC) classification system to specify question classifiers. Micro Focus recommends that you use this classification system, which is based on a commonly used set. For more information, see Training File Labels. However, you can use your own classification system if required.
Train a Classifier
To train the question classifier, you use the ManageResources
action, which accepts a JSON object with the details of the training file. For example:
action=ManageResources&SystemName=passageextractor&Data=JSON
Where the JSON object takes the following form:
{ "operation": "train", "type": "classifier", "trainingfile": "classifier_training.txt", "savemodel": true }
TIP: Typically, Micro Focus recommends that you send ManageResources
as a POST request. For testing, you can use a GET request, in which case you must base64 encode the JSON data.
If you do not want to save the training model (for example, during testing), set savemodel
to false
.
NOTE: You can save classifiers (by setting savemodel
to true
) only if you set the ClassifierFile
and LabelFile
configuration parameters in your Passage Extractor system configuration. See Configure the Passage Extractor System.
The trainingfile
parameter sets the location and name of a suitable training file. The training file contains a set of training questions, and a label that specifies the sort of answer that the question is looking for (for example, a person, place, or description).
You can use the GetResources
action to retrieve the whole JSON schema for the operation, in the same way as for Answer Bank systems. See Find the JSON Schema for Your Update.
Classifier Behavior File
In addition to the main classifier and label files, there is a classifier behavior file, which is available in the Answer Server installation.
The classifier behavior file contains details of question classifications that it must treat differently. In particular, it includes information about whether to always or never consider other question classifications when a particular classification is identified as the primary classification.
For example, you generally want to consider other location classifications when a question matches the LOC:other
classification. Similarly, for classifications that match descriptive questions you can explicitly never include other classifications, because classifications that match entities are less relevant, but might score higher in the results.
The primary classification is determined by a probability threshold, which is 0.85 by default.
If you move or rename the classification behavior JSON file, modify the ClassifierBehaviorFile
configuration parameter to specify the new name and location.