Multi-Document Summarization

From Clairlib
Jump to: navigation, search

In this tutorial you will learn how to use Clairlib to summarize multiple documents into a single condense paragraph. As in the previous tutorial (Single Document Summarization), we will use the Sentence Extraction method but instead of considering the sentences from one document, we will consider the sentences of a set of documents.

For the purpose of this tutorial, we will use a set of news articles all talking about the Gulf Air Airbus that crashed off the coast of Bahrain in the Persian Gulf on August 23, 2000 as a sample set of documents. (gulf.tar.gz).


Multi-Document Summarization Process

The process of summarizing a multiple documents in Clairlib involves the following steps:

  • Create a cluster of all the documents.
  • Compute the values of the desired features for all the sentences of all the documents in the cluster.
  • Combine the features values to score the sentences.
  • Return the sentences with highest scores as the summary.

The details of how to implement these steps using Clairlib follows.

Create a cluster of all documents

First, extract the articles files

tar xvf gult.tar.gz

Let's assume that files are extracted in "./gulf". Prepare a list of all the articles files.

@docs = glob("./gulf/*");

Then, create a new Clair::Cluster object.

 use Clair::Cluster;
 my $cluster = new Clair::Cluster();

Add the documents to the cluster

$cluster->load_file_list_array(\@docs, type => "text", filename_id => 1);

The load_file_list_array() subroutine reads in the files, creates a new Clair::Document object of type "text" for each file, assigns an incrementing number as a unique id for each Document object, and adds the Document objects to the cluster.

Compute the values of the desired features for all the sentences of all the documents in the cluster

We will compute three features for each sentence, the length, the position, and the similarity to the first sentence in the document. compute_sentence_features subroutine of the Cluster object takes a hash of features as input and computes the given features for each of the sentences of all the document in the cluster. The hash of features should have the name of the feature as the key and a reference to a subroutine that calculates the feature as the value. Clair::SentenceFeatures module provides several subroutines to compute the features of a sentence. We will make of these functions to calculate our desired features.

use Clair::SentenceFeatures qw(length_feature
# define the features hash
my %features = (
   'length' => \&length_feature,
   'position' => \&position_feature,
   'simwithfirst' => \&sim_with_first_feature

Since the computed values are for different features (and thus are of different scales), those values need to be normalized (i.e make all the values of all the features between 0 and 1). normalize_sentence_features subroutine does this. It takes as input an array of the names of the features to be normalized.

@features_names = keys %features
$cluster->normalize_sentence_features(keys %features);

Combine the features values to score the sentences

The next step is to combine the values of the features computed in the previous section to score the sentences. The score_sentences subroutine of the Cluster object scores the sentences using a given combiner subroutine. The combiner subroutine should be passed a hash containing feature names mapped to their values and should return a real number as a score. By default, the sentence scores will be normalized unless a normalize argument is passed and set to 0. Alternatively, if a weights argument is specified and hash is specified and hash of weights for the features is passes, then the returned score will be a linear combination of the features specified in the hash according to their given weights. This option will override the combiner parameter. In this tutorial we'll use the weights option and weight all the features equally.

# Create the weights hash
my %weights=();
# weight all the features equally.
$cluster->score_sentences( weights => \%weights );

Return the sentences with highest scores

Now, we have all the sentences scored based on the desired features. The last step is to pick the sentences with highest scores and return them as the summary of the original set of documents. The get_summary subroutine of the Cluster object returns the highest scored sentences. You can limit the maximum number of the returned sentences by passing a size argument and you can also choose whether to preserve the sentences order as in the original document set or to order them on their scores.

@summary = $doc->get_summary(size => 5, preserve_order => 1);

The returned @summary is an array of hash references. Each hash reference represents a sentence and contains the following key/value pairs:

  • index - The index of this sentence (starting at 0).
  • text - The text of this sentence.
  • features - A hash reference of this sentence's features.
  • score - The score of this sentence.

You can use the Data::Dumper to see the structure and the values of the returned array

use Data::Dumper;
print Dumper(@summary);

You can also loop through the array and print the summary sentences.

foreach my $sent (@summary) {
   print "$sent->{text} ";
Personal tools

Main Menu
Clairlib Lab