Generate a Collection of Synthetic Documents

From Clairlib
Jump to: navigation, search

In this tutorial, we will use the Clairlib utility '' to generate a collection of synthetic documents. It takes a pre-existing corpus and uses the 'produced' directory as the starting point to build a synthetic collection similar to the original. The utility takes some statistical information as arguments, which it uses to generate the synthetic collection. To see the options the utility provides, type the command --help

For this tutorial, we will generate a collection with the name SynthCollection from the files downloaded in the Utilities Tutorial. First, if you have already completed the Utilities Tutorial, rename the directory 'corpus' to 'source' to make things clearer:

mv corpus source

If you haven't already completed that section of the tutorial, use the first few commands in Section 6 to download the required documents into 'source' and generate and index a corpus from 'source':

mkdir source
cd source
wget -r -nd -nc
cd .. --corpus chemical --base produced \
--directory source --corpus chemical --base produced

Now, we will use the 'chemical' corpus in the directory 'produced' as a base to generate a synthetic collection. The collection's size will be 20 documents. The terms used in the synthetic documents will be chosen from the input corpus using a Zipfian distribution (with alpha = 1) with respect to term frequency. (That is, terms ranked higher in term frequency in the original corpus are more likely to appear in the synthetic collection. The probability is inversely proportional to rank * alpha, as dictated by Zipf's Law.) The lengths of the synthetic documents will mirror the lengths of the original documents. --output SynthCollection --directory synth_out \
--corpus chemical --base produced --size 20 --term-policy zipfian \
--term-alpha 1 --doclen-policy mirror --verbose

This will generate the synthetic collection in the directory 'synth_out/'. The --size argument specifies the number of documents in the synthetic collection. --term-policy is the method used to pick terms from the source corpus to include in the synthetic collection. --doclen-policy is the method used to determine the lengths of the documents. The other 'term' and 'doclen' arguments are statistical variables that vary depending on the policies used. That is, different --term-policy and --doclen-policy arguments warrant the inclusion of different variables as arguments to the utility (as delineated in the --help command).

One special term-policy is 'manualweights', which requires the user to provide a file containing a list of weights corresponding to all unique terms in the source corpus sorted from most to least frequent. One way of determining the rank of each term and modify it is to look at the file 'source_tc.txt', generated in the output directory (in this tutorial, 'synth_out').

Another interesting functionality of is its ability to not only generate random documents based on a distribution of terms and their frequencies, but also by n-grams. Rather than simply taking into account the term count data of the source corpus, can use the CMU-LM toolkit to extract n-grams from a corpus of files, then generate a collection of documents based on n-gram frequencies. Clairlib currently supports 2-, 3-, and 4-grams (as well as 1-grams, as demonstrated earlier in this section).

For this section of the tutorial, we will first create a corpus out of a file included with Clairlib, 11sent.txt. The following set of commands is reused in Section 7.13 of the tutorial:

cp $CLAIRLIB-HOME/corpora/11sent/11sent.txt ./ --input 11sent.txt --output 11sent_source --corpus 11Sent --base 11sent_produced \
--directory 11sent_source --corpus 11Sent --base 11sent_produced

Now, we have a corpus based on 11sent.txt (called 11Sent) in the directory 11sent_produced/. We can now use to generate 11 documents similar to the 11Sent corpus, based on 3-grams extracted from the source: --output NgramCorpus --directory trigram_synth_out \
--corpus 11Sent --base 11sent_produced --size 11 --ngram 3 --filetype text \
--doclen-policy mirror --verbose

The collection of documents found in the directory trigram_synth_out/ is composed entirely of 3-grams extracted from 11sent.txt. Notice that the command we used to generate this collection doesn't include a '--term-policy' argument. Clairlib uses RandomDistributionFromWeights in the finite state machine to decide on the next n-gram to use. Because this process is different from the one used during document generation using 1-grams, a '--term-policy' argument is not needed.

Note that when we generated a synthetic collection using 1-grams (i.e., terms), we did not supply an '--ngram' argument. This is because defaults to unigram-based document generation when that argument is not given. Also note that a '--filetype' argument is required whenever '--ngram' is greater than 1. The acceptable values for this argument are {text, html, stem}, in compliance with Clair::Document.

Personal tools

Main Menu
Clairlib Lab