lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Marvin Humphrey <mar...@rectangular.com>
Subject Re: Flexible index format / Payloads Cont'd
Date Fri, 30 Jun 2006 14:48:04 GMT

On Jun 30, 2006, at 6:07 AM, Nadav Har'El wrote:

> On Thu, Jun 29, 2006, Marvin Humphrey wrote about "Re: Flexible  
> index format / Payloads Cont'd":
>>   * Improve IR precision, by writing a Boolean Scorer that
>>     takes position into account, a la Brin/Page '98.
>
> Yes, I'd love to see that too (and it doesn't even require any new  
> payloads
> support, the positions that Lucene already has are enough).

True.  Any intrepid volunteers jonesing to hack on BooleanScorer2?   
Yeeha!

The reason I included this in my summary rather than separating it  
out into something we could do earlier was locality of reference.

Right now, the boolean scorers scan through freqs for all terms, but  
positions for only some terms.  For common terms, which is where the  
bulk of the cost lies in scoring, scanning though both freqs and  
positions involves a number of disk seeks, as .frq and .prx are  
consumed in 1k chunks.  This is an area where OS caching is unlikely  
to help too much, as we're talking about a lot of data.

A boolean scorer requiring that positions be read for *all* terms  
will cost more.  However, by merging the freq and prox files, those  
disk seeks are eliminated, as all the freq/prox data for a term can  
be slurped up in one contiguous read.  That may serve to mitigate the  
costs some.

However, simple term queries, at least those against fields where  
positions are stored, will cost more -- because it will be necessary  
to scan past irrelevant positional data.  I think people who do a lot  
of yes/no, unscored matches might be unhappy about that.

Generally, I'm concerned about anyone who has fine-tuned their system  
for search-time throughput.  Adding additional search-time costs may  
push some of these systems over the edge.  As a total package, I  
think the power of the changes easily justifies the price, and  
furthermore, IR precision cannot be bought with more hardware, while  
throughput can.  But I suspect there will be some interested parties  
who will disagree, and I'm sympathetic -- it would be a real bummer  
if costly "improvements" to BooleanScorer2 made your app unworkable.

BooleanScorer3 anyone?  Oi.

> I tried a small test using the Trec 8 corpus and query-relevance  
> judgements,
> and saw a noticable improvement in precision when I added a simplistic
> version of this feature: I "or"ed the original query words with
> SpanNearQuery's of each pair of words in the query, so the query of
> "hot dog bun" will be converted to something similar to:
>
> 	hot OR dog OR bun OR "hot dog"~7^0.25 "dog bun"~7^0.25 "hot  
> bun"~7^0.25

Nifty example!

One more note: Though payloads are not necessary for exploiting  
positional data, associating a boost with each position opens the  
door to an additional improvement in IR precision.  The Googs, for  
instance, describe dedicating 4-8 bits per posting to text size, so  
that e.g. text between <h1> tags gets weighted more heavily than text  
between <p> tags.

Marvin Humphrey
Rectangular Research
http://www.rectangular.com/


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Mime
View raw message