Simple Information Retrieval System

From Clairlib
Jump to: navigation, search

In this tutorial, you will learn how to build a basic information retrieval system. We will use a well-known IR test corpus, the CranField Collection, which contains 1,400 aerodynamics' documents. The main task to do in this tutorial is to process the CrannField documents, and build an inverted index from them. After that, we will build a system that takes a query, processes it, finds and ranks documents matching that query, and returns a ranked list of results.

Contents

Corpus Description

The corpus used in this tutorial is the Cranfield Collection. Cranfield documents follows the following generic XML representation:

<DOC>
<DOCNO>
...
</DOCNO>
<TITLE>
...
</TITLE>
<AUTHOR>
...
</AUTHOR>
<BIBLIO>
...
</BIBLIO>
<TEXT>
...
</TEXT>
</DOC>

Download the corpus

You can download the CranField corpus from here. After downloading the file, extract it to some directory.

wget http://www.clairlib.org/clair/clairlib/cranfield.tar.gz
mkdir cranfield
cd cranfield
tar -xvzf ../cranfield.tar.gz

Tutorial Source Code

All the code files used in this tutorial are also available at http://www.clairlib.org/clair/clairlib/irtutorial.tar.gz. Also, extract this file to some other directory.

Parse corpus files and store metadata

Parse XML file

The first thing we need to do is to write a subroutine that parses a cranfield XML file and store its content in a hash.

use XML::Simple;
sub parse_file{
   my $file=shift;
   # create object
   my $xml = new XML::Simple;
   # read XML file
   my $data = $xml->XMLin($file);
   return $data;
}

Next, we write another subroutine to process all the collection files, store them in a suitable format for indexing, and store their metadata. We need this step because the current format of the files isn't compatible with the input format that clairlib indexing tools expect. We will explain the parts of this subroutine first then show the whole code later.

Create dbm files to store metadata

We need first to create four db files to store the metadata: title, bibliography, author, and length.

   %title_meta=();
   dbmopen(%title_meta, "titles", 0666);
   %bibl_meta=();
   dbmopen(%bibl_meta, "bibl", 0666);
   %author_meta=();
   dbmopen(%author_meta, "authors", 0666);
   %length=();
   dbmopen(%length, "doclength", 0666);

This will create four db files and bind each of them with a hash.

Parse all files and store metadata

Then, we read all the files from their location ($cranfield_path), parse each file, store its content in a new txt file in the the location specified by $destination , and store its metadata in the db files through the hashes.

   if(-d $destination){
      `rm -r $destination`;
   }
   `mkdir $destination`;
   @files = <$cranfield_path/*>;
   foreach my $doc(@files){
       my $hash = parse_file("$doc");
       my $text =  $hash->{"TITLE"} . "\n" . $hash->{"TEXT"}. "\n" . $hash->{"AUTHOR"};
       my $docno = $hash->{"DOCNO"};
       $docno =~ s/[^0-9]*([0-9]+)[^0-9]*/$1/g;
       open (FILE, ">$destination/$docno.txt") or die "Can't create file";
       print FILE $text;
       close (FILE);
       $title_meta{$docno} = $hash->{"TITLE"};
       $author_meta{$docno} = $hash->{"AUTHOR"};
       $text =~ s/\.//g; $text =~ s/,//g;
       my @doclength = split /\s+/, $text;
       $length{$docno} = scalar @doclength;
   }

The first four lines of the code above check whether the directory $destination already exists or not. If it exists, it will be deleted and then a new empty directory is created. The rest of the code loops through the collection files and processes them as explained before.

Put all together

The whole code of process_collection() subroutine:

sub process_collection{
   my ($cranfield_path, $destination) = @_;
   print "parsing xml and storing metadata...\n";
   @files = `ls $source`;
   %title_meta=();
   dbmopen(%title_meta, "titles", 0666);
   %bibl_meta=();
   dbmopen(%bibl_meta, "bibl", 0666);
   %author_meta=();
   dbmopen(%author_meta, "authors", 0666);
   %length=();
   dbmopen(%length, "doclength", 0666);
   if(-d $destination){
      `rm -r $destination`;
   }
  `mkdir $destination`;
   @files = <$cranfield_path/*>;
   foreach my $doc(@files){
       my $hash = parse_file("$doc");
       my $text =  $hash->{"TITLE"} . "\n" . $hash->{"TEXT"}. "\n" . $hash->{"AUTHOR"};
       my $docno = $hash->{"DOCNO"};
       $docno =~ s/[^0-9]*([0-9]+)[^0-9]*/$1/g;
       open (FILE, ">$destination/$docno.txt") or die "Can't create file";
       print FILE $text;
       close (FILE);
       $title_meta{$docno} = $hash->{"TITLE"};
       $author_meta{$docno} = $hash->{"AUTHOR"};
       $text =~ s/\.//g; $text =~ s/,//g;
       my @doclength = split /\s+/, $text;
       $length{$docno} = scalar @doclength;
   }
}

Compute TF and IDF

By now, we have the files ready for indexing and stored in $destination. The next step is to use Clair::Utils::CorpusDownload module to read these files and create a clairlib corpus.

We start by creating an new object of the Clair::Utils::CorpusDownload.

$corpus = Clair::Utils::CorpusDownload->new(corpusname => "cranfield", rootdir => "produced");

rootdir is the path to the directory where the corpus and associated TFIDF will be built and stored.

Using the $corpus object, we call build_corpus_from_directory subroutine which builds a corpus from a set of files located on the computer.

$corpus->build_corpus_from_directory(dir=>$data_source, cleanup => 0, skipCopy => 0);

This will read all the txt files that we created above and store them in the "produced" directory in TREC "Text REtrieval Conference" format. cleanup => 0 is used to retain metafiles produced during corpus build.

Then, we build the IDF "Inverse Document Frequency" and the TF "Term Frequency".

$corpus->buildIdf(stemmed => 1);
$corpus->build_docno_dbm();
$corpus->buildTf(stemmed => 1);

The stemmed=>1 option indicates that the TF and the IDF are computed using stemmed values. Notice that we have to call build_docno_dbm before the buildTF. build_docno_dbm builds the DOCNO-to-URL and URL-to-DOCNO databases.

The whole code for the create_index sub that creates the corpus and builds the index is

sub create_index{
   my $data_source = shift;
   my $corpus = Clair::Utils::CorpusDownload->new(corpusname => "corpus", rootdir => "produced");
   $corpus->build_corpus_from_directory(dir=>"$data_source", cleanup => 0, relative => 1, skipCopy => 0);
   $corpus->buildIdf(stemmed => 1);
   $corpus->build_docno_dbm();
   $corpus->buildTf(stemmed => 1);
   $corpus->build_term_counts(stemmed => 1);
}

Now, put the three subroutines (parse_file(), process_collection(), and create_index) in a module file and name it IR.pm

Create a Perl script file index.pl and add the following to it

#! /usr/bin/perl
use IR;
my $collection_path=shift;
my $destination="data";
process_collection($collection_path,$destination);
create_index($destination);

Queries description

We need our system to be able to handle the following types of queries:

  • cat  : returns any document that has the word "cat" in it
  • cat dog  : any document that has one or more of these words ("fuzzy or" is assumed by default)
  • cat dog rat  : up to 10 words in a query
  • "tabby cat"  : phrases of up to 5 words in length
  • "small tabby cat" "shaggy dog" : multiple phrases in a query
  •  !cat
  •  !"tabby cat"  : negations of single words or phrases
  •  !cat !dog  : multiple negations per query

In this tutorial, we will not worry about nested queries in parentheses. Each query should return one or more matching documents. If multiple documents match, we should order them by decreasing score. The score is defined as the number of query terms (or phrases) that match in the "fuzzy or", including repetitions and counting phrases as the number of words that are included in them. Negated terms are not included in the score. For example, if the query is:

cat dog "pack rat"

and we have three documents D1, D2, and D3 that contain at least one of the query terms as follows:

D1: cat cat dog cat mouse
D2: pack rat cat rat rat
D3: rabbit elephant dog dog cat pack rat cat

their scores should be as follows:

D1: 4
D2: 3  ("pack rat" counts as two terms, but "rat" alone doesn't count)
D3: 6

Processing and handling the queries

We start by showing the code then the explanation follows

#!/usr/bin/perl 
use IR;
print "Enter your query or type q to quit\n>";
my $query= <>;
while ($query ne "q\n"){ 
  my @searchTerms =();  
  while ($query =~ s/!\"(.+?)\"//){
    push(@searchTerms, "!".$1);
  }
  while ($query =~ s/\"(.+?)\"//){
    push(@searchTerms, $1);
  }
  push(@searchTerms, split(/\s+/, $query));
  my ($ref, $locref) = execute_query(@searchTerms);
  my %results = %$ref;
  my %locations = %$locref;
  @sortedResults = reverse sort {$results{$a} <=> $results{$b}} keys %results;
  foreach my $result(@sortedResults){
     my $sum =  get_summary($result, $locations{$result});
     print "Doc: $result  \tScore: $results{$result}\t$sum\n"; 
  }
  print "\n>";
  $query=<>;
}

Add the code above to a Perl script file and name it query.pl

The code for the get_summary() subroutine is

sub get_summary{
   my ($docId, $position) = @_;
   my $text = `cat data/$docId.txt`;
   $text =~ s/\s\./\./g;
   my $return = "";
   my $start = 0;
   @words = split /\s+/,$text;
   if ($position > 11){
       $start = $position - 11;
   }
   for($count = $start; $count <= ($start+20); $count++){
       if (exists $words[$count]){$return .= "$words[$count] ";}
       if ($count == $start +10){$return .= "\n\t\t";}
   }
   return $return;
}

Add get_summary() to IR.pm

Read in the query and parse it

The code above starts by asking the user to enter a query to search for, or "q" to exit.

print "Enter your query or type q to quit\n>";
my $query= <>;

For each query, create a hash to store the query terms in.

my @searchTerms=();

The search term is a single word(e.g. cat), a negated word(e.g. !cat), a phrase (e.g. "cat dog"), a negated phrase(e.g. !"cat dog"). To parse the query into a set of search terms, we start by extract the negated phrases from the query (e.g. !"cat dog") and add each phrase as a single negated entry to the searchTerms hash.

while ($query =~ s/!\"(.+?)\"//){
    push(@searchTerms, "!".$1);
}

For example, if the query is

cat !dog !"dog rat" "monkey cat"

The above while loop will extract (!dog rat) from the query and add it as one entry to the searchTerm hash and leaves (cat !dog "monkey cat") in the query.

Next, we extract the unnegated phrases from the query(e.g. "monkey cat")

while ($query =~ s/\"(.+?)\"//){
    push(@searchTerms, $1);
}

The above while loop extracts the phrase (monkey cat) and add as a single search term leaving (cat !dog)

After that, we add the all the remsubrouting aining words (negated and unnegated) to searchterms

push(@searchTerms, split(/\s+/, $query));

Execute the query and return the results

After adding all the search terms to searchTerms as explained in the previous subsection, searchTerms hash is passed to a subroutine

my ($ref, $locref) = execute_query(@searchTerms);

execute_query takes an array of search terms and returns two hashes; one of DocIds of matching documents along with their scores, and the other of the location of the first result. The code of execute_query is

sub execute_query{
   my @searchTerms = @_;
   my $tf = Clair::Utils::Tf->new(rootdir => "produced", corpusname => "corpus", stemmed => 1);
   my %results = ();
   my %out = ();
   my %location = ();
   my @negation;
   my $negationOnly = 1;
   foreach $term(@searchTerms){
       if ($term=~ s/!//g){
           push(@negation, $term);
       }else{
           $negationOnly = 0;
           my @words = split(/ /, $term);
           $numWords = @words;
           my $urls = $tf->getDocsWithPhrase(@words);
           foreach my $key (keys %$urls){
               $ref = $urls->{$key};
               $toAdd = keys(%$ref) * $numWords;
               $key =~ s/.*\/([0-9]+)\.txt/$1/g;
               if (!exists $location{$key}){
                   my ($position, $storedVal) = each %$ref;  #just need the first position
                   $location{$key} = $position;
               }
               if (exists $out{$key}){
                   $out{$key}+= $toAdd;
               }else{
                   $out{$key}=$toAdd;
               }
           }
       }
   }
   if ($negationOnly == 1){
       %out = getAllDocKeys($tf);
       foreach my $key(keys %out){
           $location{$key} = 0;
       }
   }
   if ((scalar @negation) > 0){
       my @negatedDocs = negationResults(\@negation, $tf);
       foreach $removeDoc(@negatedDocs){
           if (exists $out{$removeDoc}){
               delete $out{$removeDoc};
           }
       }
   }
   return \%out, \%location;
}

Sort the results on scores

my %results = %$ref;
my %locations = %$locref;
@sortedResults = reverse sort {$results{$a} <=> $results{$b}} keys %results;

Print out the results

foreach my $result(@sortedResults){
   my $sum =  get_summary($result, $locations{$result});
   print "Doc: $result  \tScore: $results{$result}\t$sum\n"; 
}

Test the system

Now, you have three files

  • IR.pm which includes the following subroutines
    • parse_file
    • create_index
    • process_collection
    • execute_query
    • get_summary
    • getAllDocKeys
    • negationResults
  • index.pl
  • query.pl

Make sure that these files are in the same directory. To test the system, first run index.pl to create the index.

./index.pl cranfield

Then, run query.pl

./query.pl

Then, enter a search query (e.g. chemical reaction)

Enter a query to search for, or enter "q" to exit
>chemical reactions
Personal tools
Namespaces

Variants
Actions
Main Menu
Documentation
Clairlib Lab
Community
Development
Toolbox