lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Grant Ingersoll <>
Subject Re: Lucene Internals question
Date Tue, 23 Jan 2007 17:07:56 GMT
You might also be interested in 
LUCENE-755 (aka the Payloads patch) which will enable storing  
information at the token level and allow for plugging in more scoring  
options related to it.

There has been a variety of discussions over on java-dev related to  
topics that would enable things like PageRank, etc. at a lower level  
in Lucene.  Search java-dev for "Flexible Indexing", "Payloads", etc.

You will also find additional Lucene scoring information at http://


On Jan 22, 2007, at 4:44 PM, Nicolas Lalevée wrote:

> Le Lundi 22 Janvier 2007 20:44, EDMOND KEMOKAI a écrit :
>> Hmm..doesn't lucene scoring determine how relevant a document is  
>> to your
>> query? That is what PageRank and HITS do as well, I believe. Page and
>> document are the same, if you want to index a page you'll  
>> obviously try to
>> convert it into a document. PageRank does link analysis to  
>> determine how
>> relevant that page is as it relates to the query you entered, does  
>> lucene
>> have something similar? How does lucene determine between two  
>> documents
>> which one should score higher if they both contain a certain term?  
>> Google
>> uses PageRank to make that determination, how does lucene do it?
> The thing is that Google can use both its PageRank and Lucene.
> The PageRank caculation doesn't depend on the query. It's value is  
> quite
> absolute. It depends of how the other pages refer to it.
> And the Lucene scoring is calculating how a document (document is  
> the generic
> Lucene term, beacause Lucene can be used for other things that web  
> search)
> match the query. Searching for "foo bar", a document with no "foo",  
> no "bar"
> in it it scored at 0, even more not scored at all. A document with  
> one "foo",
> but no "bar" will be scored to 0.5, with one "foo" and one "bar",  
> separated
> with 15 words, will be scored 0.8, and a document with "foo bar",  
> scored to
> 1.0.
> Then you can combine the both, as Google is probably doing. Same  
> exemple :
> - a page with no "foo", neither "bar", whatever is its page rank,  
> won't be
> scored.
> - a page with a "foo" and a "bar", with a low page rank, let say 0.5
> - a page with a "foo" and a "bar", with a high page rank, let say 0.9
> If you want to know more about Lucene scoring and Page rank, you  
> should go see
> Nutch and its mailing list :
> Nicolas
>> On 1/22/07, Nicolas Lalevée <> wrote:
>>> Le Lundi 22 Janvier 2007 19:33, EDMOND KEMOKAI a écrit:
>>>> Hi All
>>>> This is a question for those familiar with lucene document  
>>>> scoring. How
>>>> does it compare with googles PageRank or HITS, or are they very
>>> different?
>>>> I have being looking at the PageRank algorithm but I'll need to
>>> brush-off
>>>> my math skills before delving into it:)
>>> In fact Lucene is just a search engine. Then you can use the search
>>> engine to
>>> search in web pages, like Nutch is using Lucene. And Google is  
>>> more like
>>> Nutch : a web crawler plus a web-search engine. So when you are  
>>> taking
>>> about
>>> page raking, it has nothing to do with Lucene scoring. Lucene  
>>> scoring is
>>> how
>>> about the result entry match your query. Page raking is more  
>>> about how
>>> relevant is the web page. So for a document, the Lucene scoring  
>>> depends
>>> on the query, and the page raking is quite absolute.
>>> Nicolas
>>> -------------------------------------------------------------------- 
>>> -
>>> To unsubscribe, e-mail:
>>> For additional commands, e-mail:
> ---------------------------------------------------------------------
> To unsubscribe, e-mail:
> For additional commands, e-mail:

Grant Ingersoll

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message