lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Doug Cutting <DCutt...@grandcentral.com>
Subject RE: Googlifying lucene querys
Date Mon, 25 Feb 2002 18:58:40 GMT
> From: Joshua O'Madadhain [mailto:jmadden@ics.uci.edu]
> 
> You cannot, in general, structure a Lucene query such that it 
> will yield
> the same document rankings that Google would for that (query, document
> set).  The reason for this is that Google employs a scoring 
> algorithm that
> includes information about the topology of the pages (i.e., how the
> pages are linked together).  (An overview of what Google does in this
> regard may be found at http://www.google.com/technology/index.html .)
> Thus, in order to get Lucene to do "what Google does", you'd have to
> rewrite large chunks of it.

I don't agree with your conclusion: you would not have to re-write much of
Lucene to incorporate this sort of information.  To my understanding, Google
uses linking information as a factor in scoring.  Thus every document in the
index has a factor computed from its links that is multiplied into its
score.

Lucene already keeps a factor per document that is multiplied into its
score, but one that is computed from the document's length, not its links.
Thus, once one has computed link scores, to add them to Lucene we just need
to permit applications to affect this factor, with something like a
Document.setBoost(float) method.  The representation of the per-document
factor would also need to change a little internally.  It is currently
stored as a single byte, and multiplying in an arbitrary factor would cause
overflow.  But enlarging it to 16 bits would be a small change.

So adding such a capability would require re-writing only a very small chunk
of Lucene.  Computing a link-based factor would also take some code, but
that's writing, not re-writing.

Doug

--
To unsubscribe, e-mail:   <mailto:lucene-user-unsubscribe@jakarta.apache.org>
For additional commands, e-mail: <mailto:lucene-user-help@jakarta.apache.org>


Mime
View raw message