Network Properties of Written Human Langage

From Clairlib
Jump to: navigation, search

In this tutorial we use Clairlib to replicate the work done by Masucci and Rodgers in their paper, Networt Properties of Written Human Langage [1], in which they investigate the nature of written human language within the framework of complex network theory. Practically, they analyze the topology of Orwell’s 1984 novel focusing on the local properties of the network.

Contents

Network Properties

The properties that we will calculate for the network that represents the text are:

  • Number of nodes: Where each node corresponds to a word or a puncutation.
  • Number of edges: Where two nodes are linked by an edge if they are neighbors.
  • Reciprocity value: Which quantifies the non-random presence of mutual edges between pairs of vertices.
  • Mean degree (word frequency): The number of different words this word is connected. The degree and the frequency of a word have the same meaning, and are equal, because every time a new word is added to the text it is the only vertex of the network to acquire an edge.
  • Degree distribution: The probability distribution of words degrees over the whole network.
  • Zipf's power law exponent: The exponent of the power law that relats the word's frequency of occurance to its rank (where words are ranked on their frequency of occurance.)
  • Growth exponent: The exponent of the growth in the number of words with respect to time.

Corpus

The corpus used in this tutorial is the well-known novel, Nineteen Eighty-Four (abbreviated to 1984) by English author George Orwell. The novel is available in txt format here.


Convert Text to Network

The first step is to build a network from the novel text. We treat the text as a finite directed network in which the words are the vertices and two vertices are linked if they are neighbors. Punctuation is also considered as vertices.

To convert the text to network, we do following steps:

  • Read in the file:
  $file = "1984.txt";
  $text = `cat $file`;
  • Split the text into separate words and store them in an array:
  $text =~ s/ \'/ /g;
  $text =~ s/\' / /g;
  $text =~ s/([^A-Za-z])/ $1 /g;
  my @words = split /\s+/, $text;
  • Normalize the words' case by converting each to lower case:
 my @res = ();
 foreach my $w (@words) {
     push @res, lc($w);
 }
  • Build the network and write it to a file:
 open(OUTFILE, ">1984.graph");
 my $i=0;
 for($i =0; $i<$#words; $i++)
 {
    print OUTFILE "$words[$i] $words[$i+1]\n";
 }
 close (OUTFILE);


Basic Network Statistics

These statistics include number of nodes, number of edges, mean degree, and degree power law distribution exponent. All these can be calculated using the clairlib utility script, print_network_stats.pl.

  print_network_stats.pl --input 1984.graph --all --force

Following are the results that we got versus the results got by Masucci and Rodgers in their paper.

Property Paper Result Clairlib Result
Number of Nodes 117687 117196
Number of edges 8992 8576
Mean degree 13.1 13.65
Degree power law distribution exponent -2.1 -1.9

Calculate the reciprocity value

  use Clair::Network qw($verbose);
  use Clair::Network::Reader::Edgelist;
  $reader = Clair::Network::Reader::Edgelist->new();
  $delim = "[ \t]+";
  $filebased = 0;
  $fname = "1984.graph";
  $net = $reader->read_network($fname,
                                 delim => $delim,
                                 directed => 1,
                                 filebased => $filebased,
                                 edge_property => "lexrank_transition",
                                 multiedge => 1);
  $n = $net->num_nodes();
  $l = scalar($net->get_edges());
  $a = $l/$n/($n-1);
  $mutual = $net->get_mutual_edges_num();
  $r = $mutual/$l;
  $ro =  ($r - $a)/(1-$a);
  print "r=$r\na=$a\nro=$ro\n";

The result is (versus the the authors' result in the paper):

  • Clairlib 0.021
  • Paper 0.0204

Calculate degree growth

  my $i=0;
  my %list = ();
  my %hist = ();
  my $count = 0;
  for($i =0; $i<$#words; $i++)
  {
          if(exists $list{$words[$i]})
          {
                  $hist{$count} = $i+1;
          }
          else{
            $hist{$count++}= $i+1;
                  $list{$words[$i]} = 0;
          }
  }
  my $reader = Clair::Network::Reader::Edgelist->new();
  my $delim = "[ \t]+";
  my $filebased = 0;
  my $net = $reader->read_network("1984.graph",
                                 delim => $delim,
                                 directed => 1,
                                 filebased => $filebased,
                                 edge_property => "lexrank_transition",
                                 multiedge => 1);
  my @fit = $net->linear_regression(\%hist, log => 1);
  print "$fit[0]\n";

The result is (versus the the authors' result in the paper):

  • Clairlib 1.53
  • Paper 1.8
Personal tools
Namespaces

Variants
Actions
Main Menu
Documentation
Clairlib Lab
Community
Development
Toolbox