Modeling Statistical Properties of Written Text
In this tutorial we use Clairlib and other tools to replicate the work done by M. Ángeles Serrano , Alessandro Flammini, and Filippo Menczer in there paper, Modeling Statistical Properties of Written Text, in which they study some statistical properties of written text including zipf's law, burstiness, Heaps' law describing the sublinear growth of vocabulary size with the length of a document, and the topicality of document collections, which encode correlations within and across documents absent in random null models. They also propose a generative model that explains the simultaneous emergence of all these patterns from simple rules.
In this tutorial we will focus on the following properties:
- Zipf's Law
- Heaps' Law
- Similarity Distribution
We study these properties for both the original datasets and for the synthetic collection generated by the model.
Dataset
The authors of the paper selected three diverse public datasets to illustrate their work. These datasets are:
- The Industrial Sector dataset (IS): a collection of corporate Web pages organized into categories or sectors.
- A sample of the Open Directory dataset(ODP): a collection of 150,000 Web pages classified into a large hierarchical taxonomy by volunteer editors.
- The Wikipedia: a random sample of 100,000 topic pages from the English Wikipedia (Wiki).
These datasets are available for download here. For the purpose of this tutorial, we will show the code and the results for the IS dataset only.
Create a Clairlib cluster out of the dataset documents
Download and extract the dataset
First, download the dataset and extract it to some directory.
wget http://www.cs.umass.edu/~mccallum/data/sector.tar.gz tar -xvf sector.tar.gz
This will extract the collection files to a directory named "sector".
Prepare a list of the dataset files
To list all the files in the dataset, use the UNIX command, "find", and pipe the results to the Perl command, "open".
open FILES,"find sector/ -type f -print|";
This will execute the "find" command and write the resulting filenames to a temporary file in the system memory (a filename per line) then assign this to a file handler, FILES.
Next, read the filenames through the handler FILES into and array
my @files = <FILES>;
Each element in this array corresponds to a filename and ends with the newline character '\n'. This character must be chopped
chomp(@files);
Create the cluster
First, instantiate an empty Clair::Cluster object
use Clair::Cluster; $cluster = new Clair::Cluster();
Then, use "load_file_list_array" function of Clair::Cluster to load the files listed in "@files" into the cluster
$cluster->load_file_list_array(\@files, type => "html");
Process the cluster files to stem the words and strip the HTML tags
Start by stripping html from the files. To do so, use the "strip_all_documents" function of Clair::Cluster
$cluset->strip_all_documents();
Then, stem the words in all the documents by calling the "stem_all_documents" function of Clair::Cluster
$cluset->stem_all_documents();
Basic statistics
In this section, we calculate the basic statistics of the IS collection. These statistics include:
- Number of documents in the collection.
- Number of non-empty documents in the collection.
- Number of words in the collection.
- Number of unique words (vocabulary) in the collection.
- Average length of documents in number of words.
- Variance in the length of documents in number of words.
- Average length of documents in number of unique words.
Count the number of files in the collection
To count the number of files in the cluster, simply call the "count_elements" function of Clair::Cluster.
my $docs_count = $cluster->count_elements();
To count the number of non-empty files, we count the number of words in each document and count the number of documents whose length is one word at least. The "documents" function of Clair::Cluster returns a reference for a hash of all the documents in the cluster. The key of this hash is the document ID and the value is a Clair::Document object.
%documents = %{ $cluster->documents() };
Then, loop through the hash values and find the length of each document by calling the "split_into_words" function and count the number of words in the resulting array
my $num_of_empty_docs = 0; foreach my $doc (values %documents){ if(scalar($doc->split_into_words(type => 'stem'))>0){ $num_of_empty_docs++; } }
This will give the following results:
- Number of documents in the collection:
- Clairlib: 9571
- Paper: 9571
- Number of non-empty documents:
- Clairlib: 9567
- Paper: 9556
The differences in the results here and in the rest of this tutorial comes from differences in stemming and stripping methods.
Count the number of unique words (vocabulary) in the collection
To count the number of unique words in the collection, use the "tf" function of Clair::Cluster which returns a hash of the frequencies of all the terms (unique words) in the collection, then count the number of elements in this hash
my %tf = $cluster->tf(); my $num_unique_words = scalar (keys %tf);
This will give the following result:
- Clairlib: 51604
- Paper: 47979
Calculate the average length of documents in number of words
This can be easily done by summing the lengths of all the documents then divide the result by the number of documents.
%documents = %{ $cluster->documents() }; my $num_of_words = 0; foreach my $doc (values %documents){ $num_of_words += scalar($doc->split_into_words(type => 'stem')); } my $average_word_length = $num_of_words / $docs_count;
This will give the following result:
- Clairlib: 316.349
- Paper: 313.46
Calculate the variance in the length of documents in number of words
Create an array of the lengths of the documents then pass it a subroutine that calculates the variance of the values given in that array.
my %documents = %{ $cluster->documents() }; my @docs_lengths =(); foreach my $doc (values %documents){ push @docs_lengths, scalar($doc->split_into_words(type => 'stem')); } my $docs_length_variance = variance(\@docs_lengths); sub variance { my $array_ref = shift; my $n = scalar(@{$array_ref}); my $result = 0; my $item; my $sum = 0; my $sum_sq = 0; my $n = scalar @{$array_ref}; foreach $item (@{$array_ref}) { $sum += $item; $sum_sq += $item*$item; } if ($n > 1) { $result = (($n * $sum_sq) - $sum**2)/($n*($n-1)); } return $result; }
This will give the following result:
- Clairlib: 552975
- Paper: 566409
Calculate the average length of the documents in number of unique words
First, use the "get_unique_words" function of Clair::Document to get an array of the unique words of each of the documents. Sum the sizes of the returned arrays then divide the result by the number of the documents.
my %documents = %{ $cluster->documents() }; my $sum_doc_lengths_unique = 0; foreach my $doc (values %documents){ $sum_doc_lengths_unique += scalar($doc-> get_unique_words()); } my $avg_length_unique = $sum_doc_lengths_unique / $docs_count;
This will give the following result:
- Clairlib: 136.78
- Paper: 124.26
Distribution of document length
According to the paper, the document length distribution can be approximately fitted by a lognormal distribution. Matlab can be used to do such fitting. First, output the lengths of the documents to a file in Matlab compatible format.
OPEN LENGTHS, ">docs_lengths" or die "can't open file: $!"; foreach my $length (@docs_lengths){ print LENGTHS $length,"\n"; } close LENGTHS;
"@docs_lengths" is the array we have just created in the previous section.
Next, start Matlab and run the following commands
load docs_lengths; PD = fitdist(docs_lengths, 'lognormal')
This will give the following results:
- Tutorial: mu 4.98, sigma 1.16
- Paper: mu 4.81, sigma (ML) 2.10
Note that the paper shows the maximum likelihood estimate for sigma, while in this tutorial, Matlab estimates the value of sigma as the square root of the unbiased estimate of the variance of the log of the data.
Zipf's Law
Zipf's law states that given some corpus of natural language utterances, the frequency of any word is inversely proportional to its rank in the frequency table. What we need to do here is to fit a zipf distribution to the frequencies of the words and estimate the value of the exponent alpha. We will use the Matlab code provided by Newman at. el. in their paper. Their code is available in http://www.santafe.edu/~aaronc/powerlaws/.
In the previous section built a hash of terms and their frequencies. Write these frequencies to a file in a Matlab compatible format.
open FREQ, ">frequencies"; foreach my $freq (values %tf){ print FREQ $freq,"\n"; } close FREQ;
Then, use the Matlab code we've just mentioned to estimate the value of alpha.
load frequencies [alpha, xmin, L] = plfit(frequencies);
This will give the following result for alpha:
- Tutorial: 1.59
- Paper: 1.78
To plot the distribution on a log-log graph, use this Matlab command
loglog(frequencies,'.','color','green');
Heaps' Law
Heaps' law describes the sublinear growth of vocabulary size with the length of the document. We'll use Matlab to plot the distribution. To do this, write, for each document in the cluster, its length in the number of words and the length in the number of unique words (vocabulary).
open HEAP, ">heap" or die "Can't open file $!"; %documents = %{ $cluster->documents() }; foreach my $doc (values %documents){ my $length_words = scalar($doc->split_into_words(type => 'stem')); my $length_unique_words = scalar($doc-> get_unique_words()); print HEAP "$length_words $length_unique_words\n"; } close HEAP;
To plot the distribution, use this Matlab code
load heap; s_size=1000; hh=sort(heap); s=zeros(s_size,2); st=floor(size(heap,1)/s_size) k=1; for i=1:s_size s(i,1)=hh(k,1); s(i,2)=hh(k,2); k=k+st; end x=heap(:,1); y=heap(:,2); loglog(x,y,'.','color','blue'); ylabel('w(n)','FontSize',16); xlabel('n','FontSize',16)
Similarity Distribution
To calculate the similarity between all the pairs of documents of the cluster, use the "compute_cosine_matrix" function of Clair::Cluster. Since the number of documents is great, the similarity computation for all the pairs will take a very long time. Therefore, you can optionally pass a "sample_size" parameter in which case the similarity will be computed only for a random sample of "sample_size" pairs.
my %cos_matrix = $cluster->compute_cosine_matrix(type=>"stem", sample_size=>5000);
To plot the similarity distribution, write the cosine values to a file in Matlab compatible format.
open COS,">cos" or die "Can't open file $!"; foreach my $doc1 (keys %cos_matrix) { foreach my $doc2 (keys %{ $cos_matrix{$doc1} }) { print COS $cos_matrix{$doc1}{$doc2}; } }
Then, use the following Matlab code to plot the similarity distribution on a semilog graph.
load cos; [n,xout]=hist(cos,100); xsim = [1:100]/100; ps=n/size(cos,1); semilogy(xsim,ps,'.','color','green') axis([0.0 1.0 10^-6 10^0]) ylabel('p(s)','FontSize',16); xlabel('s','FontSize',16)
Generative Model
The paper authors proposed a model to generate synthetic documents by drawing the words from the original dataset vocabulary. The document lengths in number of words are drawn from a lognormal distribution. Here we only show the code that implements the second version of the model that has memory mechanism to capture topicality. For more details about the model and its algorithm, refer to the paper. Clair::RandomDistribution::* modules are used to draw the random values from zipfian, lognormal, and power distributions.
use Clair::RandomDistribution::LogNormal; use Clair::RandomDistribution::Zipfian; use Clair::RandomDistribution::Exponential; $rnd_length = new Clair::RandomDistribution::LogNormal(mean=>4.81, std_dev => 1.45, dist_size=>$num_unique_words); $rnd_term = new Clair::RandomDistribution::Zipfian(alpha=>1, dist_size=>$num_unique_words); $rnd_z = new Clair::RandomDistribution::Exponential(lambda=>0, dist_size=>$num_unique_words); %r0=(); $rank=1; foreach $k (sort sortvocab(keys %tf)) { $r0{$k}=$rank; $rand++; } $r_star=1; for($i=0;$i<$size;$i++) { `mkdir synth`; open(FILE,">synth/SI_synth_$i") or die "can't open file"; print "r* = $r_star"; my %counts=(); $current_r=1; foreach $ke (sort (sortvocab keys %tf)) { if($current_r>=$r_star){ $counts{$ke}=0; } } $len=$rnd_length->draw_rand_from_dist(); $first=1; %doc = (); @terms = (); foreach $term (sort sortfreq (keys(%counts))) { push @terms, $term; } for(my $j=0;$j<$len;$j++) { $random_term = $rnd_term->draw_rand_from_dist(); $newterm = $terms[$random_term]; #print "Term=$newterm\n"; $counts{$newterm}++; $doc{$newterm}=$counts{$newterm}; $cmp=$ramdom_term-1; while ($counts{$newterm}>$counts{$terms[$cmp]} && $cmp>=0){ $cmp--; } ($terms[$newterm],$terms[$cmp]) = ($terms[$cmp], $terms[$newterm]); } foreach $term (keys %doc) { print FILE $term, " "; } close(FILE); print "Finished Synth Doc $i\n"; $r_star= 1; #$rnd_z->draw_rand_from_dist(); } sub sortfreq2{ if($counts{$a}==$counts{$b}){ if ($r0{a}>$r0{$b}){ return 1; }else{ return -1; } }else{ return $counts{$b} <=> $counts{$a}; } } sub sortfreq{ $counts{$b} <=> $counts{$a}; } sub sortvocab{ $tf{$b} <=> $tf{$a}; }