Single Document Summarization

From Clairlib
Jump to: navigation, search

In this tutorial you will learn how to use Clairlib to summarize a text document or html page. The basic idea behind text summarization is to identify the most important pieces of information from the document, omitting irrelevant information and minimizing details, and assemble them into a compact coherent report. One way to determining which pieces are most important is by splitting the document into pieces (e.g. sentences) and then computing some textual feature for each sentence, such as length, centroid, position, similarity to the first piece. The values of the features of each sentence are then combined in some way to score it. The sentences with the highest and scores are finally returned as a summary. This method of automatic summarization is called Sentence Extraction and it is the one implemented in Clairlib and discussed in this tutorial.

For the purpose of this tutorial, we will use an HTML formated news article about a Gulf Air Airbus that crashed off the coast of Bahrain in the Persian Gulf on August 23, 2000 as a sample document. (gulf.html).

Contents

Document Summarization Process

The process of summarizing a single text or html document in Clairlib involves the following steps:

  • Read in the file and create a Clair::Document object.
  • Split the document into sentences.
  • Calculate the sentences features.
  • Combine the features values to score the sentences.
  • Return the sentences with highest scores.

The details of how to do these steps using Clairlib follows.

Read in the file and create a Clair::Document object

The first step is to create a new Clair::Document object. The file name is passed to the constructor as a parameter.

 use Clair::Document;
 $file = "gulf.html";
 my $doc = new Clair::Document(type=>"html", file=>"gulf.html");
 $doc->strip_html();

The last line in the code above removes all html tags from the document and saves the resulting string as the text of the Document object.

Split the document into sentences

As was mentioned in the introduction of this tutorial, summarization starts by splitting the document into pieces (sentences, in our case). This can be easily done by calling the split_into_sentences subroutine of the Document object.

$doc->split_into_sentences();

The split_into sentences subroutine splits the document into an array of sentences and store it internally in the Document object.

Calculate the sentences features

In this tutorial, we will calculate four features for each sentence, the length, the position, the centroid, and the similarity to the first sentence in the document. The compute_sentence_features subroutine of the Document object takes a hash of features as input and computes the given features for each of the document sentences. The hash of features should have the name of the feature as the key and a reference to a subroutine that calculates the feature as the value. Clair::SentenceFeatures module provides several subroutines to compute the features of a sentence. We will make of these functions to calculate our desired features.

use Clair::SentenceFeatures qw(length_feature
                               position_feature 
                               sim_with_first_feature 
                               centroid_feature);
# define the features hash
my %features = (
   'length' => \&length_feature,
   'position' => \&position_feature,
   'simwithfirst' => \&sim_with_first_feature,
   'centroid' => \&centroid_feature
);
$doc->compute_sentence_features(%features);

Since the computed values are for different features (and thus are of different scales), those values need to be normalized (i.e make all the values of all the features between 0 and 1). The normalize_sentence_features subroutine does this. It takes as input an array of the names of the features to be normalized.

@features_names = keys %features
$doc->normalize_sentence_features(keys %features);

Combine the features values to score the sentences

The next step is to combine the values of the features computed in the previous section to score the sentences. The score_sentences subroutine of the Document object scores the sentences using a given combiner subroutine. The combiner subroutine should be passed a hash containing feature names mapped to their values and should return a real number as a score. By default, the sentence scores will be normalized unless a normalize argument is passed and set to 0. Alternatively, if a weights argument is specified and hash is specified and hash of weights for the features is passes, then the returned score will be a linear combination of the features specified in the hash according to their given weights. This option will override the combiner parameter. In this tutorial we'll use the weights option and weight all the features equally.

# Create the weights hash
my %weights=();
# weight all the features equally.
$weights{"lenght"}=1;
$weights{"position"}=1;
$weights{"simwithfirst"}=1;
$doc->score_sentences( weights => \%weights );

Return the sentences with highest scores

Now we have all the sentences scored based on the desired features. The last step is to pick the sentences with highest scores and return them as the summary of the original document. The get_summary subroutine of the Document object returns the highest scored sentences. You can limit the maximum number of the returned sentences by passing a size argument and you can also choose whether to preserve the sentences order as in the original document or to order them on their scores.

@summary = $doc->get_summary(size => 5, preserve_order => 1);

The returned @summary is an array of hash references. Each hash reference represents a sentence and contains the following key/value pairs:

  • index - The index of this sentence (starting at 0).
  • text - The text of this sentence.
  • features - A hash reference of this sentence's features.
  • score - The score of this sentence.

You can use the Data::Dumper to see the structure and the values of the returned array

use Data::Dumper;
print Dumper(@summary);

You can also loop through the array and print the summary sentences.

foreach my $sent (@summary) {
   print "$sent->{text} ";
}
Personal tools
Namespaces

Variants
Actions
Main Menu
Documentation
Clairlib Lab
Community
Development
Toolbox