Utilities Tutorial

From Clairlib
Jump to: navigation, search

A tutorial explaining how to use the Clairlib library and tools to create a network from a group of files and process that network to extract information.

Contents

Introduction

This tutorial will walk you through downloading files, creating a corpus from them, creating a network from the corpus, and extracting information along the way. We'll be using utilities included in the Clairlib package to do the work.

The best way to use this document is to read all the way through as each command is explained. The commands at the end of the tutorial in the code section.

Generating the corpus

The first thing we will need is a corpus of files to run our tests against. As an example we will be using a set of files extracted from Wikipedia. We'll first download those files into a folder:

mkdir corpus

We'll use the 'wget' command to download the files. The -r means to recursively get all of the files in the folder, -nd means don't create the directory path, and -nc means only get one copy of each file:

cd corpus
wget -r -nd -nc http://clair.si.umich.edu/clair/corpora/chemical
cd ..

Now that we have our files, we can create the corpus. To do this we'll use the 'directory_to_corpus.pl' utility. The options used here are fairly consistent for all utilities: --corpus, or -c, refers to the name of the corpus we are creating. This should be something fairly simple, since we use it often and it is used to name several of the files we'll be creating. In this case, we call our corpus 'chemical'. --base, or -b, refers to the base directory of our corpus' data files. A common practice is to use 'produced'. Lastly --directory, or -d, refers to the directory where our files to be converted are located:

directory_to_corpus.pl --corpus chemical --base produced \
 --directory corpus

Now that our corpus has been organized, we'll index it so we can then start extacting data from it. To do that we'll use 'index_corpus.pl'. Again, we'll specify the corpus name and the base directory where the index files should be produced:

index_corpus.pl --corpus chemical --base produced

We've now got our corpus and our indices and are ready to extract data.


Tfs and Idfs

First we'll run a query for the term frequency of a single term. To do this we'll use 'tf_query.pl'. Let's query 'health':

tf_query.pl -c chemical -b produced -q health

This outputs a list of the files in our corpus which contain the term 'health' and the number of times those terms occur in that file. To get term frequencies for all terms in the corpus, pass the --all option:

tf_query.pl -c chemical -b produced --all

This returns a list of terms, their frequencies, and the number of documents each occurs in.

In order to see the full list of term frequencies for stemmed terms, pass the stemmed option:

tf_query.pl -c chemical -b produced --stemmed --all

Next we'll run a query for the inverse document frequency of a single term. To do this we'll use 'idf_query'. Again, we'll query 'health':

idf_query.pl -c chemical -b produced -q health

We can also pass the --all option to idf_query.pl to get a list of idf's for all terms in the corpus:

idf_query.pl -c chemical -b produced --all

Creating a Network

We now have a corpus from which we can extract some data. Next we'll create a network from this corpus. To do this, we'll use 'corpus_to_network.pl'. This command creates a network of hyperlinks from our corpus. It produces a graph file with each line containing two linked nodes. This command requires a specified output file which we'll call 'chemical.graph':

corpus_to_network.pl -c chemical -b produced -o chemical.graph

Now we can gather some data on this network. To do that we'll run 'print_network_stats.pl' on our graph file. This command can be used to produce many different types of data. The easiest way to use it is with the --all option, which run all of its various tests. We'll redirect its output to a file:

print_network_stats.pl -i chemical.graph --all > chemical.graph.stats

If we now look at 'chemical.graph.stats' we can see statistics for our network including numbers of nodes and edges, degree statistics, clustering coefficients, and path statistics. This command also creates three centrality files (betweenness, closeness, and degree) which are lists of all terms and their centralities.

Conclusions

With the tools described above you should be able to create a corpus from a set of files and extract statistics from that corpus. For additional functionality or to get more information on the utilites used, go to

http://clair.si.umich.edu/mediawiki/index.php/Documentation.

Code

This is a list of all of the commands used in this tutorial:

mkdir corpus
cd corpus
wget -r -nd -nc http://clair.si.umich.edu/clair/corpora/chemical
cd ..
directory_to_corpus.pl --corpus chemical --base produced \
 --directory corpus
index_corpus.pl --corpus chemical --base produced
tf_query.pl -c chemical -b produced -q health
tf_query.pl -c chemical -b produced --all
idf_query.pl -c chemical -b produced -q health
idf_query.pl -c chemical -b produced --all
corpus_to_network.pl -c chemical -b produced -o chemical.graph
print_network_stats.pl -i chemical.graph --all > chemical.graph.stats
Personal tools
Namespaces

Variants
Actions
Main Menu
Documentation
Clairlib Lab
Community
Development
Toolbox