AAN Tutorial
One of the useful corpora that was recently added to Clairlib distribution, is AAN, the ACL Anthology Network. AAN is a collection of scientific publications in NLP area, collected from many ACL venues.The network is currently built using 13706 of the ACL papers. This includes all papers up to and including those published in November 2008 which were successfully processed. From those papers, we have created several networks.
Contents |
AAN Netowrks
- Paper citation
A directed network composed of nodes. Each node corresponds to a paper and each edge represents a reference from one paper to another.
- Paper citation network without self citations
The same as above but edges that represent self citations are dropped.
- Author citation
A directed network composed of nodes. Each node corresponds to an author and each edge represents a reference from one author to another.
- author citation network without self citations
The same as Author citation but edges that represent self citations, where an author references himself are dropped.
- Authors collaboration
An undirected network where nodes represent authors and edges represent instances where one author coauthored a paper with another author
AAN data files and some useful scripts to process them have been added to Clairlib. All the AAN related files can be found under $PATH-TO-CLAIRLIB/aan/. A full description of the data and statistics files, and the usage of the scripts can be found in the release.README file in the aan path. In this chapter, we will show how to use the scripts to generate AAN networks from data files and then gather useful statistics and information from them.
Generating AAN Networks
This distribution of AAN includes two data files, acl-metadata.txt and acl.txt, form which we can creat the different networks and statistics using in-house scripts. In this section we show you how to use these scripts to generate the AAN networks we mentioned above.
- Generate Paper Citation Network
bin/aan_make_paper_citations.pl
- Generate Paper Citation Network excluding self citations
bin/aan_make_paper_citations.pl --nonselfB<Generate Author Citation Network> bin/aan_make_author_citation.plB<Generate Author
- Citation Network excluding self citations
bin/aan_make_author_citation.pl -nonself
- Generate Author Collaboration Network
bin/aan_make_author_collaboration.pl
All networks generated above are formatted using the Edgelist format, which lists a single edge per line. An edge is formatted as Node1_label ==$>$ Node2_label
Basic Statistics
The main script that can be used to generate statistics for any of the networks metioned above is aan_network_stats.pl which has the following format:
aan_network_stats.pl -input=acit|acoll|pcit [--delimout=output_delimiter] [-output=output_file] [-pajek=pajek_file] [-stats] [-graphml=graphml_file] [-sample=sample_size] [-sampletype=sample_type] [-extract] [-components] [-undirected] [-paths] [-wcc] [-cc] [-scc] [-triangles] [-assortativity] [-verbose] [-localcc] [-all] [betweenness-centrality] [-degree-centrality] [-closeness-centrality] [-lexrank-centrality] [-force] [graph-class=graph_class] [-filebased] [-help]B<Some examples of how to use this script are:>
- To generate the basic statistics of the author citation network:
aan_network_stats.pl -input="acit" --stats
- To generate the statistics of the paper citation network and output the result to a file in Pajek compatible format:
aan_network_stats.pl -input="pcit" â€“pajek "pajekfile"
- To generate the statistics of the author collaboration network while treating the network as undirected:
aan_network_stats.pl -input="acoll" â€“undirected
- To generate the betweeness,degree and closeness centrality scores for every author based on the author citation network:
aan_network_stats.pl --input="acit" --degree-centrality --betweenness-centrality --closeness-centrality
- To generate statistics for 100 samples of the authors network where samples are drawn using randomnode algorithm :
aan_network_stats.pl -input="acit" --sample 100 --sampletype randomnode -all
You can also count the number of citations and collaborations for authors and papers. There are three scripts that help doing that: aan_author_citations.pl, aan_author_citations.pl, and aan_author_collaborations.pl. Some examples of how to use them are:
- To get the number of all citations for every author provided that they are older than year 2005
aan_author_citations.pl -year 2005
- To get the number of incoming citations for every paper excluding self citations
aan_author_citations.pl -incites -nonself
- To get the number of collaborations for every author
aan_author_collaborations.pl
PageRank Scores
You can get the PageRank scores for papers or authors using aan_pageranks.pl script. For example:
- To get the PageRank scores of every paper
aan_pageranks.pl -input="pcit"
- To get the PageRank scores of every author
aan_pageranks.pl -input="acit"
H-index
You can also get the H-index for every author using aan_hindex.pl script. For example
- To get the H-index for every author after excluding self citations
aan_hindex.pl -nonself
More Information
- For more information about AAN please visit: http://clair.si.umich.edu/clair/anthology/
- For detailed information about scripts and use instructions, see release.README file located in the AAN path in Clairlib.