Quick Start

From Clairlib
Jump to: navigation, search

In this tutorial we will be using Clairlib utilities to create corpora, generate networks, extract plots and statistics, and demonstrate how to perform other useful tasks. The chapter is organized into the following sections:


Generating Corpora

Generate a corpus by downloading files

output: indexed corpus
mkdir corpus
cd corpus
wget -r -nd -nc \
cd ..
directory_to_corpus.pl -c chemical -b produced -d corpus
index_corpus.pl -c chemical -b produced

Generate a corpus by crawling a site

output: indexed corpus
crawl_url.pl -u http://www.asdg.com/ -o asdg.urls
download_urls.pl -c asdg -i asdg.urls -b produced
index_corpus.pl -c asdg -b produced

Generate a corpus from a Google search

output: indexed corpus
search_to_url.pl -q bulgaria -n 10 > bulgaria.10.urls
download_urls.pl -i bulgaria.10.urls -c bulgaria-10 -b produced
index_corpus.pl -c bulgaria-10 -b produced

Generate a corpus of sentences from a document

input: collection of documents
output: indexed corpus
sentences_to_docs.pl -d $CLAIRLIB/corpora/1984/ -o docs
directory_to_corpus.pl -c 1984sents -b produced -d docs
index_corpus.pl -c 1984sents -b produced

Generate a corpus using Zipfian distribution

input: indexed corpus
output: synthetic corpus
make_synth_collection.pl --policy zipfian --alpha 1 -o synth -d synth_out -c chemical -b produced --size 11 --verbose

Gathering Corpus Statistics

Run IDF queries on a corpus

input: indexed corpus
output: idf query data
idf_query.pl -c chemical -b produced -q health
idf_query.pl -c chemical -b produced --all

Run TF queries on a corpus

input: indexed corpus
output: tf query data
tf_query.pl -c chemical -b produced -q health
tf_query.pl -c chemical -b produced --all
tf_query.pl -c chemical -b produced --stemmed --all
tf_query.pl -c chemical -b produced -q "atomic number"

Generating Networks

Generate a network from a corpus

input: indexed corpus
output: network graph
corpus_to_network.pl -c chemical -b produced -o chemical.graph

Generate synthetic network using Erdos/Renyi linking model

output: synthetic graph
# With n nodes and m edges
generate_random_network.pl -o synthetic.graph -t erdos-renyi-gnm -n 100 -m 88
# With n nodes and random edge with probability p generate_random_network.pl -o synthetic.graph \
-t erdos-renyi-gnp -n 100 -p .1
# Based on another graph
generate_random_network.pl -o synthetic.graph -i $CLAIRLIB/corpora/david_copperfield/adjnoun.graph \
-t erdos-renyi-gnp -p .1

Gathering Network Statistics

Generate plots and statistics from a corpus

input: indexed corpus
output: plots and stats
corpus_to_cos.pl -c chemical -o chemical.cos -b produced
cos_to_cosplots.pl -i chemical.cos
cos_to_histograms.pl -i chemical.cos
cos_to_stats.pl -i chemical.cos -o chemical.stats

Generate plots from a network

input: network file
output: degree distribution plots
network_to_plots.pl -i chemical.cos --bins 100

Other Useful Tools

Selecting a subset of a corpus for processing

input: existing corpus
output: directory containing subset of corpus
corpus_to_cluster.pl -c bulgaria-10 -b produced \
-f ’ˆhttps://www.cia.gov/’ \
-f ’ˆhttp://en.wikipedia.org/’ -o filtered
directory_to_corpus.pl -c bulgaria-filtered -b produced -d filtered

Convert a network from one format to another

input: gml file (or pajek file)
output: edgelist file
convert_network.pl -v \
-input $CLAIRLIB/corpora/david_copperfield/adjnoun.gml \
--input-format gml --output ./adjnoun.graph \
--output-format edgelist
print_network_stats.pl -i ./adjnoun.graph --undirected

Extract ngrams from document and create network

input: document
output: stats
extract_ngrams.pl -r "$CLAIRLIB/corpora/1984/1984.txt" \
-f text -w 1984.2gram -N 2 -sort -v
print_network_stats -i 1984.2gram -v --all --sample 100 \
--sample-type forestfire > 1984.2gram.stats

Generate statistics for word growth model from a corpus

input: indexed corpus
output: stats
required: Matlab
network_growth.pl -c chemical -b produced
stats2matlab.pl -i chemical.wordmodel.stats -o wordmodel.m
matlab -nojvm -nosplash < wordmodel.m
Personal tools

Main Menu
Clairlib Lab