Patterns in syntactic dependency networks
From CLAIRlib
In this tutorial we use Clairlib to replicate the work done by Cancho et. al. in their papers, "Patterns in Syntactic Dependency Networks"[1] and "Universality in syntactic dependency networks"[2], in which study the architecture of syntactic graphs and show that they display small world patterns, scale-free structure, a well-defined hierarchical organization, and assortative mixing. Practically, three different European languages will be used: Czech, Romanian, and German.
Contents |
Data Set
There were three data sets used in the paper, Czech, Romanian, and German corpora. The Romanian is the only one freely available. It is available at the Dependency Grammar Annotator (DGA) website here. In this tutorial we will deal only with the Romanian corpus. The README file that comes with the corpus gives details about the format of the XML files.
Network Properties
Using the Romanian corpus, we will build a Syntactic Dependency Network. This network has its nodes correspond to words and its edges correspond to binary relations (syntactic dependencies) between words. Most of the edges are directed and the arc goes from the head word to its modifier or vice versa depending on the convention used. Head and modifier are primitive concepts in the dependency grammar formalism.
After building the network we will calculate some of the properties of this network. These properties are:
- Number of nodes: Where each node corresponds to a word.
- Number of edges: Where each edge represents a syntactic dependency between a head word and a modifier.
- Clustering coefficient: The probability that two nodes(words) that are neighbors of a given node are neighbors of each other.
- Diameter: Average minimum distance between pairs of nodes.
- Power law exponent: The exponent of the power law distribution
- Size of giant component
- Average shortest path
Convert Corpus to Network
The first step is to parse the corpus and build a syntactic network from it.
- Read in the corpus XML files:
my @files = glob($xml_dir . "/t*.xml");
Where $xml_dir is the path where the corpus XML files are located. All the xml files has are named t*.xml (e.g. t10.xml and tp7.xml).
- Parse the XML files and build the network
use XML::Twig;
my $twig = new XML::Twig();
my %all_words = ();
my $num_sentences = 0;
my $num_words = 0;
my $total_words = 0;
my $word_limit = 100000;
foreach my $file (@files) {
$twig->parsefile($file) or die "Couldn't parse $file\n";
my $root = $twig->root;
# loop through the sentences in this document
my @sentences = $root->children;
$num_words = 0;
foreach my $sentence (@sentences) {
# loop through the tokens
# build up an associative array of the word and head, indexed by position
# which is then used to resolve the dependencies
my %tok_table = ();
my @tokens = $sentence->children;
$num_sentences++;
$num_words += scalar(@tokens);
# skip sentences with less than two words
if ((scalar(@tokens) >= 2) && ($num_words < $word_limit)) {
foreach my $token (@tokens) {
my $word = $token->field("orth");
my $ctag = $token->field("ctag");
$word = lc($word);
$all_words{$word}++;
my $ordno = $token->field("ordno");
my $syn = $token->first_child("syn");
my $head = $syn->field("head");
$tok_table{$ordno}{"word"} = $word;
$tok_table{$ordno}{"head"} = $head;
}
# Add the edges to the network
foreach my $key (keys %tok_table) {
my $word = $tok_table{$key}{"word"};
my $head = $tok_table{$key}{"head"};
if (not defined $tok_table{$head}) {
} else {
my $head_word = $tok_table{$head}{"word"};
$lex_network->add_edge($word, $head_word)
}
}
}
}
$total_words += $num_words;
}
The full code that reads in the files and builds the network is available here. You can use this script by running this command
parse_dg.pl -o romanian-newspaper-dga.graph
This will create a network and write it to a file, romanian-newspaper-dga.graph.
Network Statistics
These statistics include number of nodes, number of edges, clustering coefficient, diameter, power law distribution exponent, size of giant component, and average shortest path. All these can be calculated using the clairlib utility script, print_network_stats.pl.
print_network_stats.pl -i romanian-newspaper-dga.graph --force
Following are the results that we got versus the results got by Cancho et. al. in their paper.
| Property | Paper Result | Clairlib Result |
| Number of nodes | 5563 | 5598 |
| Number of edges | - | 14516 |
| Average Degree | 5.1 | 5.12 |
| size of giant component | 5563 | 5594 |
| Power law exponent | -2.19 | -2.19 |
| Clustering Coefficient | 0.09 | 0.09 |
| Average shortest path | 3.4 | 3.49 |

