Create a Linked Corpus Out of a Pre-generated Synthetic Collection

From Clairlib
Jump to: navigation, search

In this tutorial, we will create a linked corpus out of a pre-generated synthetic collection using the utility ''. First off, we need to generate a synthetic collection to pass to this utility. We will use '' to create a collection of documents called 'SynthCollection' in the directory 'synth_out', as demonstrated earlier in this tutorial:

mkdir source
cd source
wget -r -nd -nc
cd .. --corpus chemical --base produced \
--directory source --corpus chemical --base produced --output SynthCollection --directory synth_out \
--corpus chemical --base produced --size 20 --term-policy zipfian \
--term-alpha 1 --doclen-policy mirror --verbose

Now, we can use '' to link this collection of documents and create a corpus. '' provides a number of policies to use when generating the synthetic corpus. Each link policy requires various arguments, as explained by the command: --help

For this tutorial, we will use the Watts-Strogatz option, which requires the arguments -p (link probability) and -k (number of neighbors per node). -n SynthCorpus -b synth_corpus -c SynthCollection -d synth_out -l watts -p 0.42 -k 3

This command will look in the directory 'synth_out/' for a collection of documents called 'SynthCollection'. Then, it will create a directory called 'synth_corpus/' and link 'SynthCollection' to create a corpus called 'SynthCorpus'.

Personal tools

Main Menu
Clairlib Lab