lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Nicolas Lalevée <nicolas.lale...@anyware-tech.com>
Subject Re: Lucene Internals question
Date Mon, 22 Jan 2007 21:44:43 GMT
Le Lundi 22 Janvier 2007 20:44, EDMOND KEMOKAI a écrit :
> Hmm..doesn't lucene scoring determine how relevant a document is to your
> query? That is what PageRank and HITS do as well, I believe. Page and
> document are the same, if you want to index a page you'll obviously try to
> convert it into a document. PageRank does link analysis to determine how
> relevant that page is as it relates to the query you entered, does lucene
> have something similar? How does lucene determine between two documents
> which one should score higher if they both contain a certain term? Google
> uses PageRank to make that determination, how does lucene do it?

The thing is that Google can use both its PageRank and Lucene.

The PageRank caculation doesn't depend on the query. It's value is quite 
absolute. It depends of how the other pages refer to it.

And the Lucene scoring is calculating how a document (document is the generic 
Lucene term, beacause Lucene can be used for other things that web search) 
match the query. Searching for "foo bar", a document with no "foo", no "bar" 
in it it scored at 0, even more not scored at all. A document with one "foo", 
but no "bar" will be scored to 0.5, with one "foo" and one "bar", separated 
with 15 words, will be scored 0.8, and a document with "foo bar", scored to 
1.0.

Then you can combine the both, as Google is probably doing. Same exemple :
- a page with no "foo", neither "bar", whatever is its page rank, won't be 
scored.
- a page with a "foo" and a "bar", with a low page rank, let say 0.5
- a page with a "foo" and a "bar", with a high page rank, let say 0.9

If you want to know more about Lucene scoring and Page rank, you should go see 
Nutch and its mailing list : http://lucene.apache.org/nutch/

Nicolas

>
> On 1/22/07, Nicolas Lalevée <nicolas.lalevee@anyware-tech.com> wrote:
> > Le Lundi 22 Janvier 2007 19:33, EDMOND KEMOKAI a écrit:
> > > Hi All
> > > This is a question for those familiar with lucene document scoring. How
> > > does it compare with googles PageRank or HITS, or are they very
> >
> > different?
> >
> > > I have being looking at the PageRank algorithm but I'll need to
> >
> > brush-off
> >
> > > my math skills before delving into it:)
> >
> > In fact Lucene is just a search engine. Then you can use the search
> > engine to
> > search in web pages, like Nutch is using Lucene. And Google is more like
> > Nutch : a web crawler plus a web-search engine. So when you are taking
> > about
> > page raking, it has nothing to do with Lucene scoring. Lucene scoring is
> > how
> > about the result entry match your query. Page raking is more about how
> > relevant is the web page. So for a document, the Lucene scoring depends
> > on the query, and the page raking is quite absolute.
> >
> > Nicolas
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> > For additional commands, e-mail: java-user-help@lucene.apache.org

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message