Perceptron Learning and Classification

From Clairlib
Jump to: navigation, search

In this tutorial, you'll learn how to use Clairlib for extracting feature vectors from a set of document, how to train a model using the perceptron learning algorithm given the feature vectors of a training set of documents, and how to classify a set of testing documents given their feature vectors using the the perceptron classification algorithm.

Contents

Dataset

For the purpose of this tutorial, we will use two sets of XML formated documents taking about "sport", one of the two sets is for training a model and the other one is for testing the trained model. You can download the training set from here and the testing set from here

Extract both datasets to two directories, "training" and "testing"

tar -xvf training.tar.gz
tar -xvf testing.tar.gz

Extracting the feature vectors

We start by extracting the feature vectors for both the training and the testing sets. To do this, we will use the "extract_features.pl" utility that comes with Clairlib and outputs the feature vectors for a set of documents in svm_light format. For information on the usage of this utility, run

extract_features.pl --help

To extract the features for the training set, run

extract_features.pl --directory training --output features.train --parser sports --mode train --select 100 --verbose

This will compute the feature vectors for the training documents located in the "training" directory and then use the chi-square method to select the top 100 most descriminitive features, and finally outputs the vectors to the "features.train" files in the svm-light format.

To extract the feature vectors for the testing dataset, run the same command with the proper arguments

extract_features.pl --directory testing --output features.test --parser sports --mode test --select 100 --verbose

Training a model using the perceptron algorithm

Now, we can pass the training set features file to the perceptron algorithm to train a model. To do this, we will use the "learn.pl" utility that takes a training features file and trains a model. For information on the usage of this utility run

learn.pl --help

To use our training dataset for training a model, run

learn.pl --train_features features.train --model model_file --eta 0.5 --verbose

This command will write the trained model to the "model_file" file.

Classifying the testing set using a trained model

Now, we can use the trained model to classify the testing set. To do this, we will use the "classify.pl" utility which takes the testing features file and the model file as arguments and outputs the classification result in the following format

docid   score   computed_class   correct_class   y|n

For more information on the usage of this utility, run

classify.pl --help

To use this utility to classify the testing set by running

classify.pl --test_features features.test --model model_file --output results --verbose

The result of the classification will be written to the file and some statistics will be printed on the screen.

Personal tools
Namespaces

Variants
Actions
Main Menu
Documentation
Clairlib Lab
Community
Development
Toolbox