Text Classification Tutorial

From Clairlib
Jump to: navigation, search

In this tutorial, you will experiment with different machine learning methods for text classification. Here’s an overview of what you’ll do:

  • Extract feature vectors from a corpus of labeled documents.
  • Implement feature selection based on Chi-square.
  • Implement a simple text classifier; a Naïve Bayes, or a perceptron classifier.
  • Run experiments using the two different models and different numbers of features and training set size. Report and analyze the results.

Contents

Training/Testing data

We will use a subset of the 20 Newsgroups data set. This dataset is a collection of approximately 20,000 newsgroup documents, partitioned (nearly) evenly across 20 different newsgroups. The 20 newsgroups collection has become a popular data set for experiments in text applications of machine learning techniques, such as text classification and text clustering. We will only use data in 10 of the newsgroups, each corresponding to a different topic.

The documents are divided into training and test sets for you. In the training/testing set you will see ten directories '0' to '9'. These are the labels for the text files within them. You can find the data here

Documents As Feature Vectors

Each document can be represented as in the following table:

DOC WORD COUNT
123 base 2
123 bat 5
123 ball 1
123 score 2
234 puck 2
234 score 1

or as the following set of vectors:

  • [base bat ball score puck goal]
  • f(DOC123) = [ 2 5 1 2 0 0 ], class = baseball
  • f(DOC234) = [ 0 0 0 1 2 3 ], class = hockey
  • ...

Feature extraction

By default, you should use all words (after proper tokenization) that appear in the training documents as features for the classifier. Each word (feature) is weighed by its frequency in the document. Feature vectors should be extracted for both the training and testing sets and and stored in SVM light file format. The following lines represents one training example in SVM light format:

<line> .=. <target> <feature>:<value> <feature>:<value> ... <feature>:<value> 
<target> .=. <integer> 
<feature> .=. <integer> 
<value> .=. <float>

The following code (extract_feautres.pl) extracts the word frequencies and outputs the features vectors for a set of documents. The script assumes that all the documents from the same category are located in one directory.

use Clair::Document;
my $dataset_dir = shift;
my $output_file = shift;
my @categories = `ls $train_dir`;
my %features=();
$i=1;
open(OUT,">$output_file");
foreach $cat(@categories){
  chomp($cat);
  @files = `ls  $train_dir/$cat`;
  foreach $file (@files)
  {
      chomp($file);
      print OUT $cat, " ";
      my $text = `cat $train_dir/$cat/$file`;
      $text =~ s/|//g;
      my $doc = Clair::Document->new(string => $text);
      my %tf = $doc->tf(type => "stem");
      while ( my ($key, $value) = each(%tf) ) {
         my $k;
         if(defined $features{$key}){
             $k=$features{$key};
         }else{
             $k=$i;
             $features{$key}=$i;
             $i++;
         }
         print OUT "$k:$value ";
      }
      print OUT "\n";
  }
}

Run this script to extract the features of the training set and the testing set.

extract_features.pl train features.train
extract_features.pl test features.test

Where "train" and "test" are the names of the names of the directories of the training set and the testing set respectively. "features.train" and "features.test" are the names of the output files.

Text Classifier

Training a model

In this tutorial we implement a perceptron classifier that uses the documents of the training set to train a model that will be used to classify the documents of the testing set. The following script (learn.pl) uses Clair::Learn module to train the desired model.

use Clair::Learn;
$DEBUG=0;
my $train_features_file = shift;
my $model_file = shift;
my $lea = new Clair::Learn(DEBUG => $DEBUG);
$lea->naiveBayes_learn($train_features_file, $model_file);


The script can be run using:

learn.pl features.train model

Where "features.train" is the input features file that corresponds to the training set and "model" is a name for the resulting model.

Use the trained model to classify the documents of the testing set

The following script (classify.pl) uses Clair::classify module to test the trained model on the testing set

use Clair::Classify;
$DEBUG=0;
my $test_features_file = shift;
my $model_file = shift;
my $output_file = shift;
my $classify = new Clair::Classify(DEBUG => $DEBUG);
$classify->naiveBayes_classify($train_features_file, $model_file, $output_file);

The script can be run using

classify.pl features.test model results.dat

Where "features.test" is the input features file that corresponds to the testing set and "model" is the trained model that was generated in the previous step (training). results.dat will contain the results of the classification and will have a tab-separated line for each document in the follwing format:

docid   score   computed_class   correct_class   y|n

The last column outputs y if the computed_class matches the correct_class and n otherwise.

Personal tools
Namespaces

Variants
Actions
Main Menu
Documentation
Clairlib Lab
Community
Development
Toolbox