Networt Properties of Written Human Langage (Masucci and Rodgers)
From CLAIRlib
In this tutorial we use Clairlib to replicate the work done by Masucci and Rodgers in their paper, Networt Properties of Written Human Langage [1], in which they investigate the nature of written human language within the framework of complex network theory. Practically, they analyze the topology of Orwell’s 1984 novel focusing on the local properties of the network.
Contents |
Network Properties
The properties that we will calculate for the network that represents the text are:
- Number of nodes: Where each node corresponds to a word or a puncutation.
- Number of edges: Where two nodes are linked by an edge if they are neighbors.
- Reciprocity value: Which quantifies the non-random presence of mutual edges between pairs of vertices.
- Mean degree (word frequency): The number of different words this word is connected. The degree and the frequency of a word have the same meaning, and are equal, because every time a new word is added to the text it is the only vertex of the network to acquire an edge.
- Degree distribution: The probability distribution of words degrees over the whole network.
- Zipf's power law exponent: The exponent of the power law that relats the word's frequency of occurance to its rank (where words are ranked on their frequency of occurance.)
- Growth exponent: The exponent of the growth in the number of words with respect to time.
Corpus
The corpus used in this tutorial is the well-known novel, Nineteen Eighty-Four (abbreviated to 1984) by English author George Orwell. The novel is available in txt format here.
Convert Text to Network
The first step is to build a network from the novel text. We treat the text as a finite directed network in which the words are the vertices and two vertices are linked if they are neighbors. Punctuation is also considered as vertices.
To convert the text to network, we do following steps:
- Read in the file:
$file = "1984.txt"; $text = `cat $file`;
- Split the text into separate words and store them in an array:
$text =~ s/ \'/ /g; $text =~ s/\' / /g; $text =~ s/([^A-Za-z])/ $1 /g; my @words = split /\s+/, $text;
- Normalize the words' case by converting each to lower case:
my @res = ();
foreach my $w (@words) {
push @res, lc($w);
}
- Build the network and write it to a file:
open(OUTFILE, ">1984.graph");
my $i=0;
for($i =0; $i<$#words; $i++)
{
print OUTFILE "$words[$i] $words[$i+1]\n";
}
close (OUTFILE);
Basic Network Statistics
These statistics include number of nodes, number of edges, mean degree, and degree power law distribution exponent. All these can be calculated using the clairlib utility script, print_network_stats.pl.
print_network_stats.pl --input 1984.graph --all --force
Following are the results that we got versus the results got by Masucci and Rodgers in their paper.
| Property | Paper Result | Clairlib Result |
| Number of Nodes | 117687 | 117196 |
| Number of edges | 8992 | 8576 |
| Mean degree | 13.1 | 13.65 |
| Degree power law distribution exponent | -2.1 | -1.9 |
Calculate the reciprocity value
use Clair::Network qw($verbose);
use Clair::Network::Reader::Edgelist;
$reader = Clair::Network::Reader::Edgelist->new();
$delim = "[ \t]+";
$filebased = 0;
$fname = "1984.graph";
$net = $reader->read_network($fname,
delim => $delim,
directed => 1,
filebased => $filebased,
edge_property => "lexrank_transition",
multiedge => 1);
$n = $net->num_nodes();
$l = scalar($net->get_edges());
$a = $l/$n/($n-1);
$mutual = $net->get_mutual_edges_num();
$r = $mutual/$l;
$ro = ($r - $a)/(1-$a);
print "r=$r\na=$a\nro=$ro\n";
The result is (versus the the authors' result in the paper):
- Clairlib 0.021
- Paper 0.0204
Calculate degree growth
my $i=0;
my %list = ();
my %hist = ();
my $count = 0;
for($i =0; $i<$#words; $i++)
{
if(exists $list{$words[$i]})
{
$hist{$count} = $i+1;
}
else{
$hist{$count++}= $i+1;
$list{$words[$i]} = 0;
}
}
my $reader = Clair::Network::Reader::Edgelist->new();
my $delim = "[ \t]+";
my $filebased = 0;
my $net = $reader->read_network("1984.graph",
delim => $delim,
directed => 1,
filebased => $filebased,
edge_property => "lexrank_transition",
multiedge => 1);
my @fit = $net->linear_regression(\%hist, log => 1);
print "$fit[0]\n";
The result is (versus the the authors' result in the paper):
- Clairlib 1.53
- Paper 1.8

