Create a Linked Corpus Out of a Pre-generated Synthetic Collection
From CLAIRlib
In this tutorial, we will create a linked corpus out of a pre-generated synthetic collection using the utility 'link_synthetic_collection.pl'. First off, we need to generate a synthetic collection to pass to this utility. We will use 'make_synth_collection.pl' to create a collection of documents called 'SynthCollection' in the directory 'synth_out', as demonstrated earlier in this tutorial:
mkdir source cd source wget -r -nd -nc http://belobog.si.umich.edu/clair/corpora/chemical cd .. directory_to_corpus.pl --corpus chemical --base produced \ --directory source index_corpus.pl --corpus chemical --base produced make_synth_collection.pl --output SynthCollection --directory synth_out \ --corpus chemical --base produced --size 20 --term-policy zipfian \ --term-alpha 1 --doclen-policy mirror --verbose
Now, we can use 'link_synthetic_collection.pl' to link this collection of documents and create a corpus. 'link_synthetic_collection.pl' provides a number of policies to use when generating the synthetic corpus. Each link policy requires various arguments, as explained by the command:
link_synthetic_collection.pl --help
For this tutorial, we will use the Watts-Strogatz option, which requires the arguments -p (link probability) and -k (number of neighbors per node).
link_synthetic_collection.pl -n SynthCorpus -b synth_corpus -c SynthCollection -d synth_out -l watts -p 0.42 -k 3
This command will look in the directory 'synth_out/' for a collection of documents called 'SynthCollection'. Then, it will create a directory called 'synth_corpus/' and link 'SynthCollection' to create a corpus called 'SynthCorpus'.

