lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From David Spencer <dave-lucene-u...@tropo.com>
Subject Re: Page ranking
Date Tue, 01 Jun 2004 17:24:01 GMT
Scott Sayles wrote:

>Is there anyone out there that has page ranking implemented on top of
>Lucene?
>  
>

I recently discovered JUNG which has 2 impls of PageRank:

http://jung.sourceforge.net/api/1.4.1/edu/uci/ics/jung/algorithms/importance/PageRank.html

I did a test of hooking it up to my spider and calculating pagerank of 
all pages in a javadoc tree (experimented with both 
http://jakarta.apache.org/lucene/docs/api/overview-summary.html and 
http://java.sun.com/j2se/1.4.2/docs/api/overview-summary.html). 

The basic prodcedure is
[1] grab all pages to a local cache while building a table of page->page 
links
[2] using the page->page link data, calculate pageranks with JUNG and 
cache this
[3] go thru cache and index the pages ( to a Lucene index), setting each 
documents boost (Document.setBoost()) to the pagerank value

I've just got this going over the weekend. Prelim results are 
disappointing.  Pages like 
http://java.sun.com/j2se/1.4.2/docs/api/deprecated-list.html get a high 
pagerank as all kinds of pages link to it, though when I search javadoc 
I never want that page. It might be this turns out better however - I'm 
not doing any query expansion now, though next pass I'll auto-boost for 
title matches.

I can make available a table of pageranks (URL,pagerank pairs) for these 
runs if people want.

>Just in case anyone may be thinking otherwise, when I say page ranking
>I'm not referring to the ranking of results from searches.  I'm talking
>about something similar to how google computes what page may be more
>relevant or important (often referred to as PageRank) which is effected
>in part by how many other pages reference that page.
>
>I've been through the examples listed here:
>
>http://www.iprcom.com/papers/pagerank/index.html
>
>which provides information from the origianl google paper about page
>ranking.  Running the examples are fairly easy, but the big question I
>have is how can I practically update such data?  
>

I think this is a batch operation, you have to precalc it when indexing 
the entire collection.

>And is there any
>potential integration with Lucene? 
>

My thoughts are Doc.setBoost or just a plain field and store it there 
and use it to sort the results.

> It would seem that one could store
>the computed ranking values in the actual Lucene Document itself, but
>the updates 
>

Unless something has changed, index are "write-only". You really can't 
update an index other than deleting a doc and readding it, and to calc 
pagerank you need all links between pages.

>would be fairly laborious as a few minor changes in rankings
>can produce a large ripple in other related document rankings.  This, of
>course, would be the same issue if the ranking information were stored
>outside of Lucene.  One could potentially store this in a separate
>database and then look up the ranking information for each document
>found and then perform updates as an external asynchronous task.
>
>Anyone have any experience with maintaining page rankings?
>  
>

It might be of interest to see what Nutch does. It doesn't use pagerank 
but it does seem to care about the # of incoming links. I think the key 
file is IndexSegment ( see the src, not the jdoc).

>
>Thanks,
>
>Scott
>
>
>---------------------------------------------------------------------
>To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
>For additional commands, e-mail: lucene-user-help@jakarta.apache.org
>
>  
>


---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org


Mime
View raw message