The following sections describe how to create and train a classifier, and query the classifier with new documents.
For more information about the classification actions, refer to the IDOL Server Reference.
Before you create a classifier, you must choose the fields in your documents that you want to use to classify documents. These are the feature fields for the classifier.
Feature fields generally contain short pieces of information, such as a name or a very brief description. A good choice of feature field is similar to a good choice of ParametricType
field. For example, if you want to create a food classifier, you might use a field that stores ingredients, or a meal name, rather than a field that contains a recipe procedure or a detailed description of a type of food.
The feature fields must contain information that describes features of the different classes that you want to create for your classifier. For example, to classify meals as vegetarian or meat-based, you must find feature fields that describe features of vegetarian or meat-based meals.
The exact choice of feature field also depends on the contents of your documents.
For example, the following IDX document describes part of a recipe for soup:
#DREREFERENCE Food/Carrot and Coriander Soup #DRETITLE Carrot and Coriander Soup #DRESECTION 0 #DREFIELD Ingredient="carrots" #DREFIELD Ingredient="onion" #DREFIELD Ingredient="potato" #DREFIELD Herbs="coriander" #DREFIELD Seasoning="vegetable stock" #DREFIELD Meal="soup" #DREFIELD Equipment="food processor" #DREFIELD PreparationTime="20 minutes" #DREFIELD CookingTime="1 hour" #DREFIELD Description"This easy recipe makes a tasty carrot and coriander soup" #DRECONTENT Example soup recipe #DREENDDOC
Ingredient
field.Meal
and Ingredient
fields.PreparationTime
and CookingTime
fields.You can choose more than one feature field for a classifier. The classifier does not distinguish between data from different feature fields. It extracts the content from all the available feature fields from a document, and uses all the content to train the classifier (or classify a document).
For example, if your document had the fields:
#DREFIELD Ingredient1="carrots" #DREFIELD Ingredient2="onion" #DREFIELD Ingredient3="potato"
You can set Ingredient1
, Ingredient2
, and Ingredient3
as feature fields. If you use this document for classification, it gives the same results as if you used a document with the following fields:
#DREFIELD Ingredient1="onion" #DREFIELD Ingredient2="potato" #DREFIELD Ingredient3="carrots"
You create a classifier with a unique name and a set of feature fields.
To create a classifier
Send a ClassifierCreate
action to IDOL Server, with the following parameters:
ClassifierName
set to the name of the new classifier. This name must be unique in IDOL Server. ClassifierType
set to RandomForest
.FeatureFields
set to a comma-separated list of the feature fields that you want to use for the classifier. For example:
action=ClassifierCreate&ClassifierName=food&FeatureFields=Ingredient,Herbs,Seasoning
This action creates a food
classifier, which uses the Ingredient
, Herbs
, and Seasoning
fields to classify documents.
After you create the classifier, you create and assign training to the classes. You can either create the classes and assign training in a single action, or you can create the classes and train them later.
The documents that you use to train the class must exist in the IDOL Server data index. You provide training in the form of a state token, which you create by using the Query
action with the StoreState
parameter set to True
. See Choose Training Documents for Classes.
To create a class
Send a ClassifierAddClass
action to IDOL Server, with the following parameters:
ClassifierName
set to the name of the classifier.ClassName
set to the name of the new class.StateID
set to a state token that lists the documents that you want to use to train the class. For example:
action=ClassifierAddClass&ClassifierName=food&ClassName=vegetarian&StateID=B8UGIK95FKJG-23
This action creates a vegetarian
class in the food
classifier. It assigns the documents from the state token B8UGIK95FKJG-23
as training for the new class.
If you do not train the class when you create it, you can add training by using the ClassifierSetClassTraining
action. You can also use this action to retrain a class. For more information, see Retrain a Class.
You must run the ClassifierAddClass
action for each class that you want to create in the classifier.
When you create a classifier, you must train each of the classes with content that represents the classes that you want to define. The content must exist in your IDOL Server data index, and the content must contain the feature fields that you have defined for the classifier.
You provide training to the classes as a state token. You create state tokens by sending the Query
action with the StoreState
parameter set to True
. Therefore, to train a class, you must have a single query that returns the documents that define that class.
For some classifications, you might be able to perform a complex query that returns enough documents to train your classifier. However, the best way to find training is usually to manually categorize a set of documents, and add a field that labels the document with its class. You can then use a simple FieldText
query to find all documents with a particular label.
For example, if you label a set of documents with a MealType
field, with a value of savory or dessert, you can use the following query to find and save the results to use as training for the savory
class:
action=Query&FieldText=MATCH{savory}:MealType&MaxResults=1000&StoreState=True
You can use the resulting state token that this query returns to train the class. You can also create similar queries to train your other classes.
After you have trained the classifier, you can classify any new documents, and automatically add the label field to those documents.
Note: To get the best results out of your classifiers, use as many training documents as possible. HPE recommends that you use a minimum of 200 to 300 training documents for each class.
You must train the classifier before you can use it to classify documents. During this stage, IDOL Server retrieves all the training documents from the index, and extracts the feature fields. It uses the content to train each class in the classifier.
For IDOL Server to successfully train the classifier, it must have at least two classes, each of which must have training assigned.
Note: When IDOL Server trains the classifier, it ignores any very rare features.
To train a classifier
ClassifierTrain
action to IDOL Server, with the ClassifierName
parameter set to the name of the classifier.For example:
action=ClassifierTrain&ClassifierName=food
This action trains the food
classifier.
Note: The action returns an error if IDOL Server could not extract any features from the training documents (for example, because none of the training documents contained the feature fields for the classifier).
You can use a trained classifier to classify documents, by using the ClassifierQuery
action.
The document can either be:
In both cases, IDOL Server extracts the classifier feature fields from the query document, and compares the values in these feature fields against the trained classes in the classifier. The action returns the class that the document matches most closely.
To classify a document that exists in the index
Send the ClassifierQuery
action with the following parameters.
ClassifierName
set to the name of the classifier to use to classify the document.DocRef
set to the IDOL Server reference of the document to classify.For example:
action=ClassifierQuery&ClassifierName=food&DocRef=http://www.example.com/documents/carrots
To classify a document that does not exist in the index
Send the ClassifierQuery
action with the following parameters.
ClassifierName
set to the name of the classifier to use to classify the document.
QueryText
set to the percent-encoded IDX or XML document (IDOL Server detects the correct format automatically).For example:
action=ClassifierQuery&ClassifierName=food&QueryText=%23DREREFERENCE%20Food%2FLeek%20and%20Potato%20Pie%0D%0A%23DRETITLE%20Leek%20and%20Potato%20Pie%0D%0A%23DRESECTION%200%0D%0A%23DREFIELD%20Ingredient%3D%22leeks%22%0D%0A%23DREFIELD%20Ingredient%3D%22potatoes%22%0D%0A%23DREFIELD%20Ingredient%3D%22cheese%22%0D%0A%23DREFIELD%20Ingredient%3D%22pastry%22%0D%0A%23DREFIELD%20Ingredient%3D%22butter%22%0D%0A%23DREFIELD%20Ingredient%3D%22egg%22%0D%0A%23DREFIELD%20Herbs%3D%22rosemary%22%0D%0A%23DREFIELD%20Herbs%3D%22thyme%22%0D%0A%23DREFIELD%20Meal%3D%22pie%22%0D%0A%23DREFIELD%20Equipment%3D%22pie%20dish%22%0D%0A%23DREFIELD%20PreparationTime%3D%2210%20minutes%22%0D%0A%23DREFIELD%20CookingTime%3D%221%20hour%22%0D%0A%23DREFIELD%20Description%22This%20easy%20recipe%20makes%20a%20tasty%20leek%20and%20potato%20pie%22%0D%0A%23DRECONTENT%0D%0APie%20recipe%0D%0A%23DREENDDOC
This action classifies the following document:
#DREREFERENCE Food/Leek and Potato Pie #DRETITLE Leek and Potato Pie #DRESECTION 0 #DREFIELD Ingredient="leek" #DREFIELD Ingredient="potato" #DREFIELD Ingredient="cheese" #DREFIELD Ingredient="shortcrust pastry" #DREFIELD Ingredient="butter" #DREFIELD Ingredient="egg" #DREFIELD Herbs="rosemary" #DREFIELD Herbs="thyme" #DREFIELD Meal="pie" #DREFIELD Equipment="pie dish" #DREFIELD PreparationTime="10 minutes" #DREFIELD CookingTime="1 hour" #DREFIELD Description"This easy recipe makes a tasty leek and potato pie" #DRECONTENT Pie recipe #DREENDDOC
|