Perceptron Learning and Classification
In this tutorial, you'll learn how to use Clairlib for extracting feature vectors from a set of document, how to train a model using the perceptron learning algorithm given the feature vectors of a training set of documents, and how to classify a set of testing documents given their feature vectors using the the perceptron classification algorithm.
For the purpose of this tutorial, we will use two sets of XML formated documents taking about "sport", one of the two sets is for training a model and the other one is for testing the trained model. You can download the training set from here and the testing set from here
Extract both datasets to two directories, "training" and "testing"
tar -xvf training.tar.gz tar -xvf testing.tar.gz
Extracting the feature vectors
We start by extracting the feature vectors for both the training and the testing sets. To do this, we will use the "extract_features.pl" utility that comes with Clairlib and outputs the feature vectors for a set of documents in svm_light format. For information on the usage of this utility, run
To extract the features for the training set, run
extract_features.pl --directory training --output features.train --parser sports --mode train --select 100 --verbose
This will compute the feature vectors for the training documents located in the "training" directory and then use the chi-square method to select the top 100 most descriminitive features, and finally outputs the vectors to the "features.train" files in the svm-light format.
To extract the feature vectors for the testing dataset, run the same command with the proper arguments
extract_features.pl --directory testing --output features.test --parser sports --mode test --select 100 --verbose
Training a model using the perceptron algorithm
Now, we can pass the training set features file to the perceptron algorithm to train a model. To do this, we will use the "learn.pl" utility that takes a training features file and trains a model. For information on the usage of this utility run
To use our training dataset for training a model, run
learn.pl --train_features features.train --model model_file --eta 0.5 --verbose
This command will write the trained model to the "model_file" file.
Classifying the testing set using a trained model
Now, we can use the trained model to classify the testing set. To do this, we will use the "classify.pl" utility which takes the testing features file and the model file as arguments and outputs the classification result in the following format
docid score computed_class correct_class y|n
For more information on the usage of this utility, run
To use this utility to classify the testing set by running
classify.pl --test_features features.test --model model_file --output results --verbose
The result of the classification will be written to the file and some statistics will be printed on the screen.