Sie sind hier

Link Analysis on Document/Word Graphs to increase the Performance of Web Search

A standard search engine on Wikipedia could use the link graph of Wikipedia articles to calculate the PageRank of each article. After asking a query to this search engine one would retrieve all documents that contain the query words sorted by the PageRank of the returned articles. We will call this algorithm [3]. Usually also other metrics like TF-IDF are used and combined to this approach in order to increase the quality of the results. We will call this algorithm [4].

Recent work by Schenker [5] introduced the idea of word graphs. They basically consist of all words in a text collection and edges between words as soon as one word directly follows another. We want to try out the idea to not only construct a link graph of documents  but also create a word graph of all the words in the collection and then combine these two graphs. This can be done by creating an edge from every word to an article if and only if the article contains the word. The resulting graph can be used to calculate PageRank values again. (Algorithm [1] and [2].)

For this thesis the tasks is to download and parse Wikipedia (Source Code exists) in order to create the word and link graphs as well as the combined graph. The algorithms mentioned above have to be implemented and a user study has to be conducted to compare the quality of the results.


[1] Combined PageRank.
[2] Combined PageRank + TF-IDF.
[3] Standard PageRank.
[4] Standard PageRank + TF-IDF.
[5] Adam Schenker. Graph-theoretic techniques for web content mining, volume 62. World Scientic Publishing Company, 2005.