lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Grant Ingersoll <grant.ingers...@gmail.com>
Subject Re: Lucene Internals question
Date Tue, 23 Jan 2007 17:07:56 GMT
You might also be interested in https://issues.apache.org/jira/browse/ 
LUCENE-755 (aka the Payloads patch) which will enable storing  
information at the token level and allow for plugging in more scoring  
options related to it.

There has been a variety of discussions over on java-dev related to  
topics that would enable things like PageRank, etc. at a lower level  
in Lucene.  Search java-dev for "Flexible Indexing", "Payloads", etc.

You will also find additional Lucene scoring information at http:// 
lucene.apache.org/java/docs/scoring.html

-Grant

On Jan 22, 2007, at 4:44 PM, Nicolas Lalevée wrote:

> Le Lundi 22 Janvier 2007 20:44, EDMOND KEMOKAI a écrit :
>> Hmm..doesn't lucene scoring determine how relevant a document is  
>> to your
>> query? That is what PageRank and HITS do as well, I believe. Page and
>> document are the same, if you want to index a page you'll  
>> obviously try to
>> convert it into a document. PageRank does link analysis to  
>> determine how
>> relevant that page is as it relates to the query you entered, does  
>> lucene
>> have something similar? How does lucene determine between two  
>> documents
>> which one should score higher if they both contain a certain term?  
>> Google
>> uses PageRank to make that determination, how does lucene do it?
>
> The thing is that Google can use both its PageRank and Lucene.
>
> The PageRank caculation doesn't depend on the query. It's value is  
> quite
> absolute. It depends of how the other pages refer to it.
>
> And the Lucene scoring is calculating how a document (document is  
> the generic
> Lucene term, beacause Lucene can be used for other things that web  
> search)
> match the query. Searching for "foo bar", a document with no "foo",  
> no "bar"
> in it it scored at 0, even more not scored at all. A document with  
> one "foo",
> but no "bar" will be scored to 0.5, with one "foo" and one "bar",  
> separated
> with 15 words, will be scored 0.8, and a document with "foo bar",  
> scored to
> 1.0.
>
> Then you can combine the both, as Google is probably doing. Same  
> exemple :
> - a page with no "foo", neither "bar", whatever is its page rank,  
> won't be
> scored.
> - a page with a "foo" and a "bar", with a low page rank, let say 0.5
> - a page with a "foo" and a "bar", with a high page rank, let say 0.9
>
> If you want to know more about Lucene scoring and Page rank, you  
> should go see
> Nutch and its mailing list : http://lucene.apache.org/nutch/
>
> Nicolas
>
>>
>> On 1/22/07, Nicolas Lalevée <nicolas.lalevee@anyware-tech.com> wrote:
>>> Le Lundi 22 Janvier 2007 19:33, EDMOND KEMOKAI a écrit:
>>>> Hi All
>>>> This is a question for those familiar with lucene document  
>>>> scoring. How
>>>> does it compare with googles PageRank or HITS, or are they very
>>>
>>> different?
>>>
>>>> I have being looking at the PageRank algorithm but I'll need to
>>>
>>> brush-off
>>>
>>>> my math skills before delving into it:)
>>>
>>> In fact Lucene is just a search engine. Then you can use the search
>>> engine to
>>> search in web pages, like Nutch is using Lucene. And Google is  
>>> more like
>>> Nutch : a web crawler plus a web-search engine. So when you are  
>>> taking
>>> about
>>> page raking, it has nothing to do with Lucene scoring. Lucene  
>>> scoring is
>>> how
>>> about the result entry match your query. Page raking is more  
>>> about how
>>> relevant is the web page. So for a document, the Lucene scoring  
>>> depends
>>> on the query, and the page raking is quite absolute.
>>>
>>> Nicolas
>>>
>>> -------------------------------------------------------------------- 
>>> -
>>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>>> For additional commands, e-mail: java-user-help@lucene.apache.org
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>

------------------------------------------------------
Grant Ingersoll
http://www.grantingersoll.com/
http://www.paperoftheweek.com/



---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message