Simple Information Retrieval System
From CLAIRlib
In this tutorial, you will learn how to build a basic information retrieval system. We will use a well-known IR test corpus, the CranField Collection, which contains 1,400 aerodynamics' documents. The main task to do in this tutorial is to process the CrannField documents, and build an inverted index from them. After that, we will build a system that takes a query, processes it, finds and ranks documents matching that query, and returns a ranked list of results.
Contents |
Corpus Description
The corpus used in this tutorial is the Cranfield Collection. Cranfield documents follows the following generic XML representation:
<DOC> <DOCNO> ... </DOCNO> <TITLE> ... </TITLE> <AUTHOR> ... </AUTHOR> <BIBLIO> ... </BIBLIO> <TEXT> ... </TEXT> </DOC>
Download the corpus
You can download the CranField corpus from here. After downloading the file, extract it to some directory.
wget http://www.clairlib.org/clair/clairlib/cranfield.tar.gz mkdir cranfield cd cranfield tar -xvzf ../cranfield.tar.gz
Tutorial Source Code
All the code files used in this tutorial are also available at http://www.clairlib.org/clair/clairlib/irtutorial.tar.gz. Also, extract this file to some other directory.
Parse corpus files and store metadata
Parse XML file
The first thing we need to do is to write a subroutine that parses a cranfield XML file and store its content in a hash.
use XML::Simple;
sub parse_file{
my $file=shift;
# create object
my $xml = new XML::Simple;
# read XML file
my $data = $xml->XMLin($file);
return $data;
}
Next, we write another subroutine to process all the collection files, store them in a suitable format for indexing, and store their metadata. We need this step because the current format of the files isn't compatible with the input format that clairlib indexing tools expect. We will explain the parts of this subroutine first then show the whole code later.
Create dbm files to store metadata
We need first to create four db files to store the metadata: title, bibliography, author, and length.
%title_meta=(); dbmopen(%title_meta, "titles", 0666); %bibl_meta=(); dbmopen(%bibl_meta, "bibl", 0666); %author_meta=(); dbmopen(%author_meta, "authors", 0666); %length=(); dbmopen(%length, "doclength", 0666);
This will create four db files and bind each of them with a hash.
Parse all files and store metadata
Then, we read all the files from their location ($cranfield_path), parse each file, store its content in a new txt file in the the location specified by $destination , and store its metadata in the db files through the hashes.
if(-d $destination){
`rm -r $destination`;
}
`mkdir $destination`;
@files = <$cranfield_path/*>;
foreach my $doc(@files){
my $hash = parse_file("$doc");
my $text = $hash->{"TITLE"} . "\n" . $hash->{"TEXT"}. "\n" . $hash->{"AUTHOR"};
my $docno = $hash->{"DOCNO"};
$docno =~ s/[^0-9]*([0-9]+)[^0-9]*/$1/g;
open (FILE, ">$destination/$docno.txt") or die "Can't create file";
print FILE $text;
close (FILE);
$title_meta{$docno} = $hash->{"TITLE"};
$author_meta{$docno} = $hash->{"AUTHOR"};
$text =~ s/\.//g; $text =~ s/,//g;
my @doclength = split /\s+/, $text;
$length{$docno} = scalar @doclength;
}
The first four lines of the code above check whether the directory $destination already exists or not. If it exists, it will be deleted and then a new empty directory is created. The rest of the code loops through the collection files and processes them as explained before.
Put all together
The whole code of process_collection() subroutine:
sub process_collection{
my ($cranfield_path, $destination) = @_;
print "parsing xml and storing metadata...\n";
@files = `ls $source`;
%title_meta=();
dbmopen(%title_meta, "titles", 0666);
%bibl_meta=();
dbmopen(%bibl_meta, "bibl", 0666);
%author_meta=();
dbmopen(%author_meta, "authors", 0666);
%length=();
dbmopen(%length, "doclength", 0666);
if(-d $destination){
`rm -r $destination`;
}
`mkdir $destination`;
@files = <$cranfield_path/*>;
foreach my $doc(@files){
my $hash = parse_file("$doc");
my $text = $hash->{"TITLE"} . "\n" . $hash->{"TEXT"}. "\n" . $hash->{"AUTHOR"};
my $docno = $hash->{"DOCNO"};
$docno =~ s/[^0-9]*([0-9]+)[^0-9]*/$1/g;
open (FILE, ">$destination/$docno.txt") or die "Can't create file";
print FILE $text;
close (FILE);
$title_meta{$docno} = $hash->{"TITLE"};
$author_meta{$docno} = $hash->{"AUTHOR"};
$text =~ s/\.//g; $text =~ s/,//g;
my @doclength = split /\s+/, $text;
$length{$docno} = scalar @doclength;
}
}
Compute TF and IDF
By now, we have the files ready for indexing and stored in $destination. The next step is to use Clair::Utils::CorpusDownload module to read these files and create a clairlib corpus.
We start by creating an new object of the Clair::Utils::CorpusDownload.
$corpus = Clair::Utils::CorpusDownload->new(corpusname => "cranfield", rootdir => "produced");
rootdir is the path to the directory where the corpus and associated TFIDF will be built and stored.
Using the $corpus object, we call build_corpus_from_directory subroutine which builds a corpus from a set of files located on the computer.
$corpus->build_corpus_from_directory(dir=>$data_source, cleanup => 0, skipCopy => 0);
This will read all the txt files that we created above and store them in the "produced" directory in TREC "Text REtrieval Conference" format. cleanup => 0 is used to retain metafiles produced during corpus build.
Then, we build the IDF "Inverse Document Frequency" and the TF "Term Frequency".
$corpus->buildIdf(stemmed => 1); $corpus->build_docno_dbm(); $corpus->buildTf(stemmed => 1);
The stemmed=>1 option indicates that the TF and the IDF are computed using stemmed values. Notice that we have to call build_docno_dbm before the buildTF. build_docno_dbm builds the DOCNO-to-URL and URL-to-DOCNO databases.
The whole code for the create_index sub that creates the corpus and builds the index is
sub create_index{
my $data_source = shift;
my $corpus = Clair::Utils::CorpusDownload->new(corpusname => "corpus", rootdir => "produced");
$corpus->build_corpus_from_directory(dir=>"$data_source", cleanup => 0, relative => 1, skipCopy => 0);
$corpus->buildIdf(stemmed => 1);
$corpus->build_docno_dbm();
$corpus->buildTf(stemmed => 1);
$corpus->build_term_counts(stemmed => 1);
}
Now, put the three subroutines (parse_file(), process_collection(), and create_index) in a module file and name it IR.pm
Create a Perl script file index.pl and add the following to it
#! /usr/bin/perl use IR; my $collection_path=shift; my $destination="data"; process_collection($collection_path,$destination); create_index($destination);
Queries description
We need our system to be able to handle the following types of queries:
- cat : returns any document that has the word "cat" in it
- cat dog : any document that has one or more of these words ("fuzzy or" is assumed by default)
- cat dog rat : up to 10 words in a query
- "tabby cat" : phrases of up to 5 words in length
- "small tabby cat" "shaggy dog" : multiple phrases in a query
- !cat
- !"tabby cat" : negations of single words or phrases
- !cat !dog : multiple negations per query
In this tutorial, we will not worry about nested queries in parentheses. Each query should return one or more matching documents. If multiple documents match, we should order them by decreasing score. The score is defined as the number of query terms (or phrases) that match in the "fuzzy or", including repetitions and counting phrases as the number of words that are included in them. Negated terms are not included in the score. For example, if the query is:
cat dog "pack rat"
and we have three documents D1, D2, and D3 that contain at least one of the query terms as follows:
D1: cat cat dog cat mouse D2: pack rat cat rat rat D3: rabbit elephant dog dog cat pack rat cat
their scores should be as follows:
D1: 4
D2: 3 ("pack rat" counts as two terms, but "rat" alone doesn't count)
D3: 6
Processing and handling the queries
We start by showing the code then the explanation follows
#!/usr/bin/perl
use IR;
print "Enter your query or type q to quit\n>";
my $query= <>;
while ($query ne "q\n"){
my @searchTerms =();
while ($query =~ s/!\"(.+?)\"//){
push(@searchTerms, "!".$1);
}
while ($query =~ s/\"(.+?)\"//){
push(@searchTerms, $1);
}
push(@searchTerms, split(/\s+/, $query));
my ($ref, $locref) = execute_query(@searchTerms);
my %results = %$ref;
my %locations = %$locref;
@sortedResults = reverse sort {$results{$a} <=> $results{$b}} keys %results;
foreach my $result(@sortedResults){
my $sum = get_summary($result, $locations{$result});
print "Doc: $result \tScore: $results{$result}\t$sum\n";
}
print "\n>";
$query=<>;
}
Add the code above to a Perl script file and name it query.pl
The code for the get_summary() subroutine is
sub get_summary{
my ($docId, $position) = @_;
my $text = `cat data/$docId.txt`;
$text =~ s/\s\./\./g;
my $return = "";
my $start = 0;
@words = split /\s+/,$text;
if ($position > 11){
$start = $position - 11;
}
for($count = $start; $count <= ($start+20); $count++){
if (exists $words[$count]){$return .= "$words[$count] ";}
if ($count == $start +10){$return .= "\n\t\t";}
}
return $return;
}
Add get_summary() to IR.pm
Read in the query and parse it
The code above starts by asking the user to enter a query to search for, or "q" to exit.
print "Enter your query or type q to quit\n>"; my $query= <>;
For each query, create a hash to store the query terms in.
my @searchTerms=();
The search term is a single word(e.g. cat), a negated word(e.g. !cat), a phrase (e.g. "cat dog"), a negated phrase(e.g. !"cat dog"). To parse the query into a set of search terms, we start by extract the negated phrases from the query (e.g. !"cat dog") and add each phrase as a single negated entry to the searchTerms hash.
while ($query =~ s/!\"(.+?)\"//){
push(@searchTerms, "!".$1);
}
For example, if the query is
cat !dog !"dog rat" "monkey cat"
The above while loop will extract (!dog rat) from the query and add it as one entry to the searchTerm hash and leaves (cat !dog "monkey cat") in the query.
Next, we extract the unnegated phrases from the query(e.g. "monkey cat")
while ($query =~ s/\"(.+?)\"//){
push(@searchTerms, $1);
}
The above while loop extracts the phrase (monkey cat) and add as a single search term leaving (cat !dog)
After that, we add the all the remsubrouting aining words (negated and unnegated) to searchterms
push(@searchTerms, split(/\s+/, $query));
Execute the query and return the results
After adding all the search terms to searchTerms as explained in the previous subsection, searchTerms hash is passed to a subroutine
my ($ref, $locref) = execute_query(@searchTerms);
execute_query takes an array of search terms and returns two hashes; one of DocIds of matching documents along with their scores, and the other of the location of the first result. The code of execute_query is
sub execute_query{
my @searchTerms = @_;
my $tf = Clair::Utils::Tf->new(rootdir => "produced", corpusname => "corpus", stemmed => 1);
my %results = ();
my %out = ();
my %location = ();
my @negation;
my $negationOnly = 1;
foreach $term(@searchTerms){
if ($term=~ s/!//g){
push(@negation, $term);
}else{
$negationOnly = 0;
my @words = split(/ /, $term);
$numWords = @words;
my $urls = $tf->getDocsWithPhrase(@words);
foreach my $key (keys %$urls){
$ref = $urls->{$key};
$toAdd = keys(%$ref) * $numWords;
$key =~ s/.*\/([0-9]+)\.txt/$1/g;
if (!exists $location{$key}){
my ($position, $storedVal) = each %$ref; #just need the first position
$location{$key} = $position;
}
if (exists $out{$key}){
$out{$key}+= $toAdd;
}else{
$out{$key}=$toAdd;
}
}
}
}
if ($negationOnly == 1){
%out = getAllDocKeys($tf);
foreach my $key(keys %out){
$location{$key} = 0;
}
}
if ((scalar @negation) > 0){
my @negatedDocs = negationResults(\@negation, $tf);
foreach $removeDoc(@negatedDocs){
if (exists $out{$removeDoc}){
delete $out{$removeDoc};
}
}
}
return \%out, \%location;
}
Sort the results on scores
my %results = %$ref;
my %locations = %$locref;
@sortedResults = reverse sort {$results{$a} <=> $results{$b}} keys %results;
Print out the results
foreach my $result(@sortedResults){
my $sum = get_summary($result, $locations{$result});
print "Doc: $result \tScore: $results{$result}\t$sum\n";
}
Test the system
Now, you have three files
- IR.pm which includes the following subroutines
- parse_file
- create_index
- process_collection
- execute_query
- get_summary
- getAllDocKeys
- negationResults
- index.pl
- query.pl
Make sure that these files are in the same directory. To test the system, first run index.pl to create the index.
./index.pl cranfield
Then, run query.pl
./query.pl
Then, enter a search query (e.g. chemical reaction)
Enter a query to search for, or enter "q" to exit >chemical reactions

