Generate a URL-Based Network out of a Hyperlinked Dataset

From Clairlib
Jump to: navigation, search

In this tutorial we will use some Clairlib utils to generate a URL-based network out of a hyperlinked dataset without the need to index it since indexing takes a long time in large datasets. We will show two ways to do this:

Contents

By creating a Clairlib corpus first

This way involves three steps:

  • Create a corpus out of the dataset
  • Generate the links database of the corpus
  • Create the URL-based network

Create a Corpus out of the dataset

For the purpose of this tutorial we'll build a corpus by downloading the dataset files from internet. We will use the chemical elements dataset as an example.

download_urls.pl -c chemical -i http://belobog.si.umich.edu/clair/corpora/chemical -b produced

If you have the files already downloaded and stored in some directory on your machine, you can use

directory_to_corpus.pl --corpus chemical --directory source --base produced --type html

where "source" the directory where the dataset is located.

Generate the links database of the corpus

To do this, we'll use index_corpus.pl util. However, we'll pass some parameters that instruct it to skip the indexing part and only build the files needed for the next step (i.e. building the URL network of the corpus)

index_corpus.pl --corpus checmial --base produced --notf --noidf --nostats

The --notf, the --noidf, and the --nostats arguments ask the code to skip the indexing steps.

Create the URL-Based Network

To create the URL network, we'll use the corpus_to_network.pl util as follows

corpus_to_network.pl -c chemical -b produced -o chemical.graph

Where chemical.graph is the name of the resulting graph file.

Without creating a corpus

To do this, lets download the chemical files and store them into a directory "chemical_src"

mkdir chemical_src
cd chemical_src
wget -r -nd -nc http://belobog.si.umich.edu/clair/corpora/chemical
cd ..

Then, we will use the directory_to_URL_network.pl utility as follows:

directory_to_URL_network.pl --directory chemical_src --output chemical.graph
Personal tools
Namespaces

Variants
Actions
Main Menu
Documentation
Clairlib Lab
Community
Development
Toolbox