The following corpora are used in the documentation to showcase some of the functionality of clairlib.

The currently available corpora are:

A well-known IR test collection containing 1,400 aerodynamics' documents.

Definition of the chemical elements taken from Wikipedia

A set of Romanian texts (newspaper articles) annotated according to the Dependency Grammar formalism

A classic dystopian novel by the English author George Orwell; published in 1949.

Eleven news statements about the war on Iraq.

  • Padgett Florentine Families Network (tar) (website)

Subset of data on the social relations among Renaissance Florentine families collected by John Padgett from historical documents.

A network of friendships between the 34 members of a karate club at a US university, as described by Wayne Zachary in 1977. In GML format.

  • Adjnoun (gml)

Charles Dickens - David Copperfield adjective-noun adjacency network

  • emma.txt (txt)

Jane Austen - Emma

Pajek word-adjacency networks for English, French, Spanish, and Japanese

Handwritten digits, from Prof. Xiaojin Zhu's PhD thesis.

Handwritten digits, from Prof. Zhu. 4000 digits, 10 classes (0 - 9).

ppattach data (Prepositional phrases "eat salad with fork" with either verb or noun attachment -- "salad with fork" vs. "eat with fork")

Examples of "ADJ CONJ ADJ" from WSJ corpus -- has "and", "but", and "or"

  • Milan.txt (txt)

A collection of approximately 20,000 newsgroup documents, partitioned (nearly) evenly across 20 different newsgroups.

  • Flavors (txt)

A simple example graph in edgelist format used for graph partitioning.

  • Signed Network (net)

A simple example graph used for graph partitioning. In Pajek format.

