Modeling Statistical Properties of Written Text

From Clairlib
Jump to: navigation, search

In this tutorial we use Clairlib and other tools to replicate the work done by M. √Āngeles Serrano , Alessandro Flammini, and Filippo Menczer in there paper, Modeling Statistical Properties of Written Text, in which they study some statistical properties of written text including zipf's law, burstiness, Heaps' law describing the sublinear growth of vocabulary size with the length of a document, and the topicality of document collections, which encode correlations within and across documents absent in random null models. They also propose a generative model that explains the simultaneous emergence of all these patterns from simple rules.

In this tutorial we will focus on the following properties:

  • Zipf's Law
  • Heaps' Law
  • Similarity Distribution

We study these properties for both the original datasets and for the synthetic collection generated by the model.

Contents

Dataset

The authors of the paper selected three diverse public datasets to illustrate their work. These datasets are:

  • The Industrial Sector dataset (IS): a collection of corporate Web pages organized into categories or sectors.
  • A sample of the Open Directory dataset(ODP): a collection of 150,000 Web pages classified into a large hierarchical taxonomy by volunteer editors.
  • The Wikipedia: a random sample of 100,000 topic pages from the English Wikipedia (Wiki).

These datasets are available for download here. For the purpose of this tutorial, we will show the code and the results for the IS dataset only.

Create a Clairlib cluster out of the dataset documents

Download and extract the dataset

First, download the dataset and extract it to some directory.

wget http://www.cs.umass.edu/~mccallum/data/sector.tar.gz
tar -xvf sector.tar.gz

This will extract the collection files to a directory named "sector".

Prepare a list of the dataset files

To list all the files in the dataset, use the UNIX command, "find", and pipe the results to the Perl command, "open".

open FILES,"find sector/ -type f -print|";

This will execute the "find" command and write the resulting filenames to a temporary file in the system memory (a filename per line) then assign this to a file handler, FILES.

Next, read the filenames through the handler FILES into and array

my @files = <FILES>;

Each element in this array corresponds to a filename and ends with the newline character '\n'. This character must be chopped

chomp(@files);

Create the cluster

First, instantiate an empty Clair::Cluster object

use Clair::Cluster;
$cluster = new Clair::Cluster();

Then, use "load_file_list_array" function of Clair::Cluster to load the files listed in "@files" into the cluster

$cluster->load_file_list_array(\@files, type => "html");

Process the cluster files to stem the words and strip the HTML tags

Start by stripping html from the files. To do so, use the "strip_all_documents" function of Clair::Cluster

$cluset->strip_all_documents();

Then, stem the words in all the documents by calling the "stem_all_documents" function of Clair::Cluster

$cluset->stem_all_documents();

Basic statistics

In this section, we calculate the basic statistics of the IS collection. These statistics include:

  • Number of documents in the collection.
  • Number of non-empty documents in the collection.
  • Number of words in the collection.
  • Number of unique words (vocabulary) in the collection.
  • Average length of documents in number of words.
  • Variance in the length of documents in number of words.
  • Average length of documents in number of unique words.

Count the number of files in the collection

To count the number of files in the cluster, simply call the "count_elements" function of Clair::Cluster.

my $docs_count = $cluster->count_elements();

To count the number of non-empty files, we count the number of words in each document and count the number of documents whose length is one word at least. The "documents" function of Clair::Cluster returns a reference for a hash of all the documents in the cluster. The key of this hash is the document ID and the value is a Clair::Document object.

%documents = %{ $cluster->documents() };

Then, loop through the hash values and find the length of each document by calling the "split_into_words" function and count the number of words in the resulting array

my $num_of_empty_docs = 0;
foreach my $doc (values %documents){
   if(scalar($doc->split_into_words(type => 'stem'))>0){
       $num_of_empty_docs++;
   }
}

This will give the following results:

  • Number of documents in the collection:
    • Clairlib: 9571
    • Paper: 9571
  • Number of non-empty documents:
    • Clairlib: 9567
    • Paper: 9556

The differences in the results here and in the rest of this tutorial comes from differences in stemming and stripping methods.

Count the number of unique words (vocabulary) in the collection

To count the number of unique words in the collection, use the "tf" function of Clair::Cluster which returns a hash of the frequencies of all the terms (unique words) in the collection, then count the number of elements in this hash

my %tf = $cluster->tf();
my $num_unique_words = scalar (keys %tf);

This will give the following result:

  • Clairlib: 51604
  • Paper: 47979

Calculate the average length of documents in number of words

This can be easily done by summing the lengths of all the documents then divide the result by the number of documents.

%documents = %{ $cluster->documents() };
my $num_of_words = 0;
foreach my $doc (values %documents){
       $num_of_words  += scalar($doc->split_into_words(type => 'stem'));
}
my $average_word_length = $num_of_words / $docs_count;

This will give the following result:

  • Clairlib: 316.349
  • Paper: 313.46

Calculate the variance in the length of documents in number of words

Create an array of the lengths of the documents then pass it a subroutine that calculates the variance of the values given in that array.

my %documents = %{ $cluster->documents() };
my @docs_lengths =();
foreach my $doc (values %documents){
       push @docs_lengths, scalar($doc->split_into_words(type => 'stem'));
}
my $docs_length_variance = variance(\@docs_lengths);

sub variance
{
   my $array_ref = shift;
   my $n = scalar(@{$array_ref});
   my $result = 0;
   my $item;
   my $sum = 0;
   my $sum_sq = 0;
   my $n = scalar @{$array_ref};
   foreach $item (@{$array_ref})
   {
       $sum += $item;
       $sum_sq += $item*$item;
   }
   if ($n > 1)
   {
       $result = (($n * $sum_sq) - $sum**2)/($n*($n-1));
   }
   return $result;
}


This will give the following result:

  • Clairlib: 552975
  • Paper: 566409

Calculate the average length of the documents in number of unique words

First, use the "get_unique_words" function of Clair::Document to get an array of the unique words of each of the documents. Sum the sizes of the returned arrays then divide the result by the number of the documents.

my %documents = %{ $cluster->documents() };
my $sum_doc_lengths_unique = 0;
foreach my $doc (values %documents){
       $sum_doc_lengths_unique += scalar($doc-> get_unique_words());
}
my $avg_length_unique = $sum_doc_lengths_unique / $docs_count;

This will give the following result:

  • Clairlib: 136.78
  • Paper: 124.26

Distribution of document length

According to the paper, the document length distribution can be approximately fitted by a lognormal distribution. Matlab can be used to do such fitting. First, output the lengths of the documents to a file in Matlab compatible format.

OPEN LENGTHS, ">docs_lengths" 
        or die "can't open file: $!";
foreach my $length (@docs_lengths){
    print LENGTHS $length,"\n";
}
close LENGTHS;

"@docs_lengths" is the array we have just created in the previous section.

Next, start Matlab and run the following commands

load docs_lengths;
PD = fitdist(docs_lengths, 'lognormal')

This will give the following results:

  • Tutorial: mu 4.98, sigma 1.16
  • Paper: mu 4.81, sigma (ML) 2.10

Note that the paper shows the maximum likelihood estimate for sigma, while in this tutorial, Matlab estimates the value of sigma as the square root of the unbiased estimate of the variance of the log of the data.

Zipf's Law

Zipf's law states that given some corpus of natural language utterances, the frequency of any word is inversely proportional to its rank in the frequency table. What we need to do here is to fit a zipf distribution to the frequencies of the words and estimate the value of the exponent alpha. We will use the Matlab code provided by Newman at. el. in their paper. Their code is available in http://www.santafe.edu/~aaronc/powerlaws/.

In the previous section built a hash of terms and their frequencies. Write these frequencies to a file in a Matlab compatible format.

open FREQ, ">frequencies";
foreach my $freq (values %tf){
    print FREQ $freq,"\n";
}
close FREQ;

Then, use the Matlab code we've just mentioned to estimate the value of alpha.

load frequencies
[alpha, xmin, L] = plfit(frequencies);

This will give the following result for alpha:

  • Tutorial: 1.59
  • Paper: 1.78

To plot the distribution on a log-log graph, use this Matlab command

loglog(frequencies,'.','color','green');

Heaps' Law

Heaps' law describes the sublinear growth of vocabulary size with the length of the document. We'll use Matlab to plot the distribution. To do this, write, for each document in the cluster, its length in the number of words and the length in the number of unique words (vocabulary).

open HEAP, ">heap" or die "Can't open file $!";
%documents = %{ $cluster->documents() };
foreach my $doc (values %documents){
   my $length_words = scalar($doc->split_into_words(type => 'stem'));
   my $length_unique_words = scalar($doc-> get_unique_words());
   print HEAP "$length_words $length_unique_words\n";
}
close HEAP;

To plot the distribution, use this Matlab code

load heap;
s_size=1000;
hh=sort(heap);
s=zeros(s_size,2);
st=floor(size(heap,1)/s_size)
k=1;
for i=1:s_size
   s(i,1)=hh(k,1);
   s(i,2)=hh(k,2);
   k=k+st;
end
x=heap(:,1);
y=heap(:,2);
loglog(x,y,'.','color','blue');
ylabel('w(n)','FontSize',16);
xlabel('n','FontSize',16)

Similarity Distribution

To calculate the similarity between all the pairs of documents of the cluster, use the "compute_cosine_matrix" function of Clair::Cluster. Since the number of documents is great, the similarity computation for all the pairs will take a very long time. Therefore, you can optionally pass a "sample_size" parameter in which case the similarity will be computed only for a random sample of "sample_size" pairs.

my %cos_matrix = $cluster->compute_cosine_matrix(type=>"stem", sample_size=>5000);

To plot the similarity distribution, write the cosine values to a file in Matlab compatible format.

open COS,">cos" or die "Can't open file $!";
foreach my $doc1 (keys %cos_matrix) {
     foreach my $doc2 (keys %{ $cos_matrix{$doc1} }) {
            print COS $cos_matrix{$doc1}{$doc2};
     }
}

Then, use the following Matlab code to plot the similarity distribution on a semilog graph.

load cos;
[n,xout]=hist(cos,100);
xsim = [1:100]/100;
ps=n/size(cos,1);
semilogy(xsim,ps,'.','color','green')
axis([0.0 1.0 10^-6 10^0])
ylabel('p(s)','FontSize',16);
xlabel('s','FontSize',16)

Generative Model

The paper authors proposed a model to generate synthetic documents by drawing the words from the original dataset vocabulary. The document lengths in number of words are drawn from a lognormal distribution. Here we only show the code that implements the second version of the model that has memory mechanism to capture topicality. For more details about the model and its algorithm, refer to the paper. Clair::RandomDistribution::* modules are used to draw the random values from zipfian, lognormal, and power distributions.

use Clair::RandomDistribution::LogNormal;
use Clair::RandomDistribution::Zipfian;
use Clair::RandomDistribution::Exponential;
  
$rnd_length = new Clair::RandomDistribution::LogNormal(mean=>4.81, std_dev => 1.45, dist_size=>$num_unique_words);
$rnd_term = new Clair::RandomDistribution::Zipfian(alpha=>1, dist_size=>$num_unique_words);
$rnd_z = new Clair::RandomDistribution::Exponential(lambda=>0, dist_size=>$num_unique_words);
  
%r0=();
$rank=1;
foreach $k (sort sortvocab(keys %tf))
{
     $r0{$k}=$rank;
     $rand++;
}
$r_star=1;
for($i=0;$i<$size;$i++)
{
   `mkdir synth`;
   open(FILE,">synth/SI_synth_$i") or die "can't open file";
   print "r* = $r_star";
   my %counts=();
   $current_r=1;
   foreach $ke (sort (sortvocab keys %tf))
   {
       if($current_r>=$r_star){
           $counts{$ke}=0;
       }
   }
   $len=$rnd_length->draw_rand_from_dist();
   $first=1;
   %doc = ();
   @terms = ();
   foreach $term (sort sortfreq (keys(%counts)))
   {
         push @terms, $term;
   }
   for(my $j=0;$j<$len;$j++)
   {
        $random_term = $rnd_term->draw_rand_from_dist();
        $newterm = $terms[$random_term];
        #print "Term=$newterm\n";
        $counts{$newterm}++;
        $doc{$newterm}=$counts{$newterm};
        $cmp=$ramdom_term-1;
        while ($counts{$newterm}>$counts{$terms[$cmp]} && $cmp>=0){
               $cmp--;
        }
        ($terms[$newterm],$terms[$cmp]) = ($terms[$cmp], $terms[$newterm]);
   }
   foreach $term (keys %doc)
   {
     print FILE  $term, " ";
   }
   close(FILE);
   print "Finished Synth Doc $i\n";
   $r_star= 1; #$rnd_z->draw_rand_from_dist();
}

sub sortfreq2{
   if($counts{$a}==$counts{$b}){
       if ($r0{a}>$r0{$b}){
            return 1;
       }else{
            return -1;
       }
   }else{
         return $counts{$b} <=> $counts{$a};
   }
}

sub sortfreq{
   $counts{$b} <=> $counts{$a};
}
 
sub sortvocab{
   $tf{$b} <=> $tf{$a};
}
Personal tools
Namespaces

Variants
Actions
Main Menu
Documentation
Clairlib Lab
Community
Development
Toolbox