Generate a URL-Based Network out of a Hyperlinked Dataset

From Clairlib
Jump to: navigation, search

In this tutorial we will use some Clairlib utils to generate a URL-based network out of a hyperlinked dataset without the need to index it since indexing takes a long time in large datasets. We will show two ways to do this:


By creating a Clairlib corpus first

This way involves three steps:

  • Create a corpus out of the dataset
  • Generate the links database of the corpus
  • Create the URL-based network

Create a Corpus out of the dataset

For the purpose of this tutorial we'll build a corpus by downloading the dataset files from internet. We will use the chemical elements dataset as an example. -c chemical -i -b produced

If you have the files already downloaded and stored in some directory on your machine, you can use --corpus chemical --directory source --base produced --type html

where "source" the directory where the dataset is located.

Generate the links database of the corpus

To do this, we'll use util. However, we'll pass some parameters that instruct it to skip the indexing part and only build the files needed for the next step (i.e. building the URL network of the corpus) --corpus checmial --base produced --notf --noidf --nostats

The --notf, the --noidf, and the --nostats arguments ask the code to skip the indexing steps.

Create the URL-Based Network

To create the URL network, we'll use the util as follows -c chemical -b produced -o chemical.graph

Where chemical.graph is the name of the resulting graph file.

Without creating a corpus

To do this, lets download the chemical files and store them into a directory "chemical_src"

mkdir chemical_src
cd chemical_src
wget -r -nd -nc
cd ..

Then, we will use the utility as follows: --directory chemical_src --output chemical.graph
Personal tools

Main Menu
Clairlib Lab