lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From David Spencer <>
Subject Re: Page ranking
Date Tue, 01 Jun 2004 17:24:01 GMT
Scott Sayles wrote:

>Is there anyone out there that has page ranking implemented on top of

I recently discovered JUNG which has 2 impls of PageRank:

I did a test of hooking it up to my spider and calculating pagerank of 
all pages in a javadoc tree (experimented with both and 

The basic prodcedure is
[1] grab all pages to a local cache while building a table of page->page 
[2] using the page->page link data, calculate pageranks with JUNG and 
cache this
[3] go thru cache and index the pages ( to a Lucene index), setting each 
documents boost (Document.setBoost()) to the pagerank value

I've just got this going over the weekend. Prelim results are 
disappointing.  Pages like get a high 
pagerank as all kinds of pages link to it, though when I search javadoc 
I never want that page. It might be this turns out better however - I'm 
not doing any query expansion now, though next pass I'll auto-boost for 
title matches.

I can make available a table of pageranks (URL,pagerank pairs) for these 
runs if people want.

>Just in case anyone may be thinking otherwise, when I say page ranking
>I'm not referring to the ranking of results from searches.  I'm talking
>about something similar to how google computes what page may be more
>relevant or important (often referred to as PageRank) which is effected
>in part by how many other pages reference that page.
>I've been through the examples listed here:
>which provides information from the origianl google paper about page
>ranking.  Running the examples are fairly easy, but the big question I
>have is how can I practically update such data?  

I think this is a batch operation, you have to precalc it when indexing 
the entire collection.

>And is there any
>potential integration with Lucene? 

My thoughts are Doc.setBoost or just a plain field and store it there 
and use it to sort the results.

> It would seem that one could store
>the computed ranking values in the actual Lucene Document itself, but
>the updates 

Unless something has changed, index are "write-only". You really can't 
update an index other than deleting a doc and readding it, and to calc 
pagerank you need all links between pages.

>would be fairly laborious as a few minor changes in rankings
>can produce a large ripple in other related document rankings.  This, of
>course, would be the same issue if the ranking information were stored
>outside of Lucene.  One could potentially store this in a separate
>database and then look up the ranking information for each document
>found and then perform updates as an external asynchronous task.
>Anyone have any experience with maintaining page rankings?

It might be of interest to see what Nutch does. It doesn't use pagerank 
but it does seem to care about the # of incoming links. I think the key 
file is IndexSegment ( see the src, not the jdoc).

>To unsubscribe, e-mail:
>For additional commands, e-mail:

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message