Create a Linked Corpus Out of a Pre-generated Synthetic Collection

From Clairlib
Jump to: navigation, search

In this tutorial, we will create a linked corpus out of a pre-generated synthetic collection using the utility 'link_synthetic_collection.pl'. First off, we need to generate a synthetic collection to pass to this utility. We will use 'make_synth_collection.pl' to create a collection of documents called 'SynthCollection' in the directory 'synth_out', as demonstrated earlier in this tutorial:

mkdir source
cd source
wget -r -nd -nc http://belobog.si.umich.edu/clair/corpora/chemical
cd ..
directory_to_corpus.pl --corpus chemical --base produced \
--directory source
index_corpus.pl --corpus chemical --base produced
make_synth_collection.pl --output SynthCollection --directory synth_out \
--corpus chemical --base produced --size 20 --term-policy zipfian \
--term-alpha 1 --doclen-policy mirror --verbose

Now, we can use 'link_synthetic_collection.pl' to link this collection of documents and create a corpus. 'link_synthetic_collection.pl' provides a number of policies to use when generating the synthetic corpus. Each link policy requires various arguments, as explained by the command:

link_synthetic_collection.pl --help

For this tutorial, we will use the Watts-Strogatz option, which requires the arguments -p (link probability) and -k (number of neighbors per node).

link_synthetic_collection.pl -n SynthCorpus -b synth_corpus -c SynthCollection -d synth_out -l watts -p 0.42 -k 3

This command will look in the directory 'synth_out/' for a collection of documents called 'SynthCollection'. Then, it will create a directory called 'synth_corpus/' and link 'SynthCollection' to create a corpus called 'SynthCorpus'.

Personal tools
Namespaces

Variants
Actions
Main Menu
Documentation
Clairlib Lab
Community
Development
Toolbox