lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Marvin Humphrey <>
Subject Re: Rich positions (was "boosting fields")
Date Sat, 29 Apr 2006 17:48:33 GMT

On Apr 29, 2006, at 12:40 AM, Marvin Humphrey wrote:
> One file, the "PostingsFile", which merges the FreqFile, ProxFile,  
> and Boost/Norm for each posting into a single contiguous block,  
> with an eye towards aggressively minimizing disk seeks.

Interpolating the positions between the Freqs is inefficient for a  
simple term query, provided that a score multiplier is available for  
each document and it does not have to be built up posting by  
posting.  However, simple term queries typically do not stress the  
system, and if the cost of scanning through positions is significant,  
at least no disk seeks are required.

Phrase queries should theoretically benefit from having the  
interleaving of positional data and frequency data.  At present,  
fetching freq data and prox data will generally require at least two  
disk seeks per term; if they are interleaved, the number of seeks is  
cut in half, roughly.  It's unlikely all the freq and prox data for a  
common term in a large index will be fetched in a single go, but it  
seems likely that there will continue to be an advantage to having  
freq data and prox data interleaved even then.

If boolean queries do not use positional information except when  
there is a sub-query which is a phrase query, then having positions  
interpolated is a loss.  However, Brin/Page 1998 proposes using  
positional data to improve precision, by scoring documents higher  
when any two terms occur near each other, even though they may not  
have been grouped together by the user.  A particularly sophisticated  
variant might also take into account word ordering in the query phrase.

If we establish the constraint that boolean queries must exploit  
positional data, then it's a clear win for merging the FreqFile and  
the ProxFile.

Marvin Humphrey
Rectangular Research

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message