Quick Start
From CLAIRlib
In this tutorial we will be using Clairlib utilities to create corpora, generate networks, extract plots and statistics, and demonstrate how to perform other useful tasks. The chapter is organized into the following sections:
Contents |
Generating Corpora
Generate a corpus by downloading files
output: indexed corpus mkdir corpus cd corpus wget -r -nd -nc \ http://belobog.si.umich.edu/clair/corpora/chemical cd .. directory_to_corpus.pl -c chemical -b produced -d corpus index_corpus.pl -c chemical -b produced
Generate a corpus by crawling a site
output: indexed corpus crawl_url.pl -u http://www.asdg.com/ -o asdg.urls download_urls.pl -c asdg -i asdg.urls -b produced index_corpus.pl -c asdg -b produced
Generate a corpus from a Google search
output: indexed corpus search_to_url.pl -q bulgaria -n 10 > bulgaria.10.urls download_urls.pl -i bulgaria.10.urls -c bulgaria-10 -b produced index_corpus.pl -c bulgaria-10 -b produced
Generate a corpus of sentences from a document
input: collection of documents output: indexed corpus sentences_to_docs.pl -d $CLAIRLIB/corpora/1984/ -o docs directory_to_corpus.pl -c 1984sents -b produced -d docs index_corpus.pl -c 1984sents -b produced
Generate a corpus using Zipfian distribution
input: indexed corpus output: synthetic corpus make_synth_collection.pl --policy zipfian --alpha 1 -o synth -d synth_out -c chemical -b produced --size 11 --verbose
Gathering Corpus Statistics
Run IDF queries on a corpus
input: indexed corpus output: idf query data idf_query.pl -c chemical -b produced -q health idf_query.pl -c chemical -b produced --all
Run TF queries on a corpus
input: indexed corpus output: tf query data tf_query.pl -c chemical -b produced -q health tf_query.pl -c chemical -b produced --all tf_query.pl -c chemical -b produced --stemmed --all tf_query.pl -c chemical -b produced -q "atomic number"
Generating Networks
Generate a network from a corpus
input: indexed corpus output: network graph corpus_to_network.pl -c chemical -b produced -o chemical.graph
Generate synthetic network using Erdos/Renyi linking model
output: synthetic graph # With n nodes and m edges generate_random_network.pl -o synthetic.graph -t erdos-renyi-gnm -n 100 -m 88 # With n nodes and random edge with probability p generate_random_network.pl -o synthetic.graph \ -t erdos-renyi-gnp -n 100 -p .1 # Based on another graph generate_random_network.pl -o synthetic.graph -i $CLAIRLIB/corpora/david_copperfield/adjnoun.graph \ -t erdos-renyi-gnp -p .1
Gathering Network Statistics
Generate plots and statistics from a corpus
input: indexed corpus output: plots and stats corpus_to_cos.pl -c chemical -o chemical.cos -b produced cos_to_cosplots.pl -i chemical.cos cos_to_histograms.pl -i chemical.cos cos_to_stats.pl -i chemical.cos -o chemical.stats
Generate plots from a network
input: network file output: degree distribution plots network_to_plots.pl -i chemical.cos --bins 100
Other Useful Tools
Selecting a subset of a corpus for processing
input: existing corpus output: directory containing subset of corpus corpus_to_cluster.pl -c bulgaria-10 -b produced \ -f ’ˆhttps://www.cia.gov/’ \ -f ’ˆhttp://en.wikipedia.org/’ -o filtered directory_to_corpus.pl -c bulgaria-filtered -b produced -d filtered
Convert a network from one format to another
input: gml file (or pajek file) output: edgelist file convert_network.pl -v \ -input $CLAIRLIB/corpora/david_copperfield/adjnoun.gml \ --input-format gml --output ./adjnoun.graph \ --output-format edgelist print_network_stats.pl -i ./adjnoun.graph --undirected
Extract ngrams from document and create network
input: document output: stats extract_ngrams.pl -r "$CLAIRLIB/corpora/1984/1984.txt" \ -f text -w 1984.2gram -N 2 -sort -v print_network_stats -i 1984.2gram -v --all --sample 100 \ --sample-type forestfire > 1984.2gram.stats
Generate statistics for word growth model from a corpus
input: indexed corpus output: stats required: Matlab network_growth.pl -c chemical -b produced stats2matlab.pl -i chemical.wordmodel.stats -o wordmodel.m matlab -nojvm -nosplash < wordmodel.m

