Corpora
From CLAIRlib
The following corpora are used in the documentation to showcase some of the functionality of clairlib.
The currently available corpora are:
A well-known IR test collection containing 1,400 aerodynamics' documents.
Definition of the chemical elements taken from Wikipedia
A set of Romanian texts (newspaper articles) annotated according to the Dependency Grammar formalism
A classic dystopian novel by the English author George Orwell; published in 1949.
- 11sent (txt)
Eleven news statements about the war on Iraq.
Subset of data on the social relations among Renaissance Florentine families collected by John Padgett from historical documents.
A network of friendships between the 34 members of a karate club at a US university, as described by Wayne Zachary in 1977. In GML format.
- Adjnoun (gml)
Charles Dickens - David Copperfield adjective-noun adjacency network
- emma.txt (txt)
Jane Austen - Emma
- Languages (tar.gz)
Pajek word-adjacency networks for English, French, Spanish, and Japanese
- One_vs_two (website)
Handwritten digits, from Prof. Xiaojin Zhu's PhD thesis.
- 10classes (website)
Handwritten digits, from Prof. Zhu. 4000 digits, 10 classes (0 - 9).
- samplepp (tar.gz)
ppattach data (Prepositional phrases "eat salad with fork" with either verb or noun attachment -- "salad with fork" vs. "eat with fork")
- adj-conj-adj (tar.gz)
Examples of "ADJ CONJ ADJ" from WSJ corpus -- has "and", "but", and "or"
- Milan.txt (txt)
A collection of approximately 20,000 newsgroup documents, partitioned (nearly) evenly across 20 different newsgroups.
- Flavors (txt)
A simple example graph in edgelist format used for graph partitioning.
- Signed Network (net)
A simple example graph used for graph partitioning. In Pajek format.

