lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Marvin Humphrey <mar...@rectangular.com>
Subject Rich positions (was "boosting fields")
Date Thu, 27 Apr 2006 18:58:28 GMT

On Apr 27, 2006, at 9:41 AM, Doug Cutting wrote:

> karl wettin wrote:
>> My own immediate thought is to compromise by allowing boost per  
>> term  in document. Simply remove the norms-methods from the  
>> IndexReader and  add a new one to the TermEnum and fall back on  
>> the field boost. How  would the value be picked up by the scorer?
>> Boost per position, et.c. sounds very expensive.
>
> Indeed.  It will probably nearly double the size of indexes and  
> also increase search time.

I have been considering making a similar change to the KinoSearch  
file format.  Not having to cache norms radically cuts down on the  
time required to launch a fresh Searcher, especially if there aren't  
any deleted docs.  That's a win if you're launching a search app from  
scratch, like if you're running a web search under CGI rather than  
mod_perl.  It's also a win for refreshing a Searcher against a  
frequently updated index.

What I was considering was interleaving the document's score- 
multiplier norm byte between the VInts in the .frq file.  That would  
mean more disk i/o for processing terms when the term takes up more  
than a block on the file system, but at least the info would be  
contiguous.

I hadn't considered interleaving the score-multiplier into .prx, but  
that opens many possibilities.  Boost positions that appear near the  
top of the doc.  Boost positions if they occur within certain HTML  
tags.  Good stuff!

Moving away from cached norms was the second of three major changes  
to the file format on my agenda, and the one I was all but certain I  
wouldn't be able to sell to the Lucene community.  The first was  
using bytecounts at the head of Strings.

The third was storing start offsets and end offsets in the ProxFile.   
It rankles that much of the information from tis/frq/prx gets  
duplicated in the term vector files, but highlighting is most  
efficient when you know the offsets, and the primary index stops  
short of storing that information.  Currently, we have this:

     ProxFile (.prx) -->  <TermPositions>TermCount

How about this?

     ProxFile (.prx) -->  <TermPositions,TermOffsets>TermCount

To get highlighting info now, you retrieve a document's term vector  
information and then extract the offsets information for the precise  
term.  This format reverses the order: first you find the term, then  
you extract the offsets info for a particular doc.

I haven't implemented this change yet, so I'm not sure how it works  
out.  The current version of KinoSearch stores term vectors in  
the .fdt file, which is a win for locality of reference.  It sure  
would be nice to eliminate all that duplicated data, though.

Marvin Humphrey
Rectangular Research
http://www.rectangular.com/


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Mime
View raw message