Generate a URL-Based Network out of a Hyperlinked Dataset
In this tutorial we will use some Clairlib utils to generate a URL-based network out of a hyperlinked dataset without the need to index it since indexing takes a long time in large datasets. We will show two ways to do this:
By creating a Clairlib corpus first
This way involves three steps:
- Create a corpus out of the dataset
- Generate the links database of the corpus
- Create the URL-based network
Create a Corpus out of the dataset
For the purpose of this tutorial we'll build a corpus by downloading the dataset files from internet. We will use the chemical elements dataset as an example.
download_urls.pl -c chemical -i http://belobog.si.umich.edu/clair/corpora/chemical -b produced
If you have the files already downloaded and stored in some directory on your machine, you can use
directory_to_corpus.pl --corpus chemical --directory source --base produced --type html
where "source" the directory where the dataset is located.
To do this, we'll use index_corpus.pl util. However, we'll pass some parameters that instruct it to skip the indexing part and only build the files needed for the next step (i.e. building the URL network of the corpus)
index_corpus.pl --corpus checmial --base produced --notf --noidf --nostats
The --notf, the --noidf, and the --nostats arguments ask the code to skip the indexing steps.
Create the URL-Based Network
To create the URL network, we'll use the corpus_to_network.pl util as follows
corpus_to_network.pl -c chemical -b produced -o chemical.graph
Where chemical.graph is the name of the resulting graph file.
Without creating a corpus
To do this, lets download the chemical files and store them into a directory "chemical_src"
mkdir chemical_src cd chemical_src wget -r -nd -nc http://belobog.si.umich.edu/clair/corpora/chemical cd ..
Then, we will use the directory_to_URL_network.pl utility as follows:
directory_to_URL_network.pl --directory chemical_src --output chemical.graph