Corpora

From Clairlib
Jump to: navigation, search

The following corpora are used in the documentation to showcase some of the functionality of clairlib.

The currently available corpora are:


A well-known IR test collection containing 1,400 aerodynamics' documents.


Definition of the chemical elements taken from Wikipedia


A set of Romanian texts (newspaper articles) annotated according to the Dependency Grammar formalism


A classic dystopian novel by the English author George Orwell; published in 1949.


Eleven news statements about the war on Iraq.


  • Padgett Florentine Families Network (tar) (website)

Subset of data on the social relations among Renaissance Florentine families collected by John Padgett from historical documents.


A network of friendships between the 34 members of a karate club at a US university, as described by Wayne Zachary in 1977. In GML format.


  • Adjnoun (gml)

Charles Dickens - David Copperfield adjective-noun adjacency network


  • emma.txt (txt)

Jane Austen - Emma


Pajek word-adjacency networks for English, French, Spanish, and Japanese


Handwritten digits, from Prof. Xiaojin Zhu's PhD thesis.


Handwritten digits, from Prof. Zhu. 4000 digits, 10 classes (0 - 9).


ppattach data (Prepositional phrases "eat salad with fork" with either verb or noun attachment -- "salad with fork" vs. "eat with fork")


Examples of "ADJ CONJ ADJ" from WSJ corpus -- has "and", "but", and "or"


  • Milan.txt (txt)


A collection of approximately 20,000 newsgroup documents, partitioned (nearly) evenly across 20 different newsgroups.


  • Flavors (txt)

A simple example graph in edgelist format used for graph partitioning.


  • Signed Network (net)

A simple example graph used for graph partitioning. In Pajek format.

Personal tools
Namespaces

Variants
Actions
Main Menu
Documentation
Clairlib Lab
Community
Development
Toolbox