Using Mead as a summarizer with Clairlib

From Clairlib
Jump to: navigation, search

MEAD is a publicly available toolkit for multi-lingual summarization and evaluation. The toolkit implements multiple summarization algorithms (at arbitrary compression rates) such as position-based, Centroid, TF*IDF, and query-based methods. Methods for evaluating the quality of the summaries include co-selection (precision/recall, kappa, and relative utility) and content-based measures (cosine, word overlap, bigram overlap). You can download the latest version of MEAD from http://www.summarization.com/mead.

The Clair::MEAD::* modules forms an interface to MEAD. We'll use Clair::MEAD::Wrapper to access MEAD summarization functionalities.

For the purpose of this tutorial, we will use a set of news articles all talking about the Gulf Air Airbus that crashed off the coast of Bahrain in the Persian Gulf on August 23, 2000 as a sample set of documents. (gulf.tar.gz).

Contents

Summarization Process

The summarization process should go through the following steps:

  • Create a cluster of all the documents.
  • Create a Clair::Mead::Wrapper object.
  • Specify the summarization options.
  • Run MEAD and return the summary.

The details of how to implement these steps using Clairlib follows.

Create a cluster of all the documents

First, extract the articles files

tar xvf gult.tar.gz

Let's assume that files are extracted in "./gulf". Prepare a list of all the articles files.

@docs = glob("./gulf/*");

Then, create a new Clair::Cluster object.

 use Clair::Cluster;
 my $cluster = new Clair::Cluster();

Add the documents to the cluster

$cluster->load_file_list_array(\@docs, type => "text", filename_id => 1);

The load_file_list_array() subroutine reads in the files, creates a new Clair::Document object of type "text" for each file, assigns an incrementing number as a unique id for each Document object, and adds the Document objects to the cluster.

Create a Clair::Mead::Wrapper object

The Clair::MEAD::Wrapper module is a wrapper for MEAD that enables you to access MEAD functionalities. To create a new Clair::MEAD::Wrapper object, use

use Clair::MEAD::Wrapper;
my $mead = Clair::MEAD::Wrapper->new(
    mead_home => "/path/to/mead";
    cluster => $cluster
);

The 'mead_home' argument tells Clairlib where to find a MEAD instance. Change its value to the path of your MEAD. The 'cluster' argument is a Clair::Cluster object containing the documents to summarize. We pass the $cluster object that we created in the previous section as a value for this argument.

Specify the summarization options

To control the way MEAD works, you can specify some options and pass them to mead using the add_option() subroutine of the Wrapper object. Some of the common options are:

  • -sentences, -s: produce a summary whose length is either an absolute number or a percentage of the number of sentences

of the original cluster. (This is the default.)

  • -words, -w: produce a summary whose length is either an absolute number or a percentage of the number of words of

the original cluster.

  • -percent num, -p num, -compression_percent num: produce a summary whose length is num% the length of the original cluster. (The default is -percent 20)
  • -absolute num, -a num, -compression_absolute num: produce a summary whose length is num (words/sentences) regardless of the size of the original cluster. NOTE: if both -percent and -absolute are specified, MEAD’s behavior may be erratic.
  • -system RANDOM, -RANDOM: produce a random summary (and name the system “RANDOM”).
  • -system LEADBASED, -LEADBASED produce a lead-based summary, selecting the first sentence from each document, then the second sentence, etc.. (and name the system “LEADBASED”). NOTE: RANDOM and LEADBASED systems override any classifier, reranker, and features that may be specified.
  • -lang language: The default is “ENG”. This option doesn’t really do a whole lot currently. Since mead.pl doesn’t currently do Chinese summarization, you’ll probably never have to specify “CHIN”. To do summarizaton in Chinese, refer to the appropriate section (you’ll have to use the old-fashioned meadconfig file method).

For example, if we want to produce a summary whose length is 5% of the number of sentences of the original cluster, we use,

$mead->add_option("-s -p 5");

Run MEAD and return the summary

The final step is to run MEAD to summarize the given cluster based on the specified options

my @summary = $mead->run_mead();

The run_mead() subroutine returns the summary in the form of an array of string sentences. You can print the summary by looping through the array.

foreach my $sent (@summary) {
   print "$sent->{text} ";
}
Personal tools
Namespaces

Variants
Actions
Main Menu
Documentation
Clairlib Lab
Community
Development
Toolbox