lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Andrzej Bialecki ...@getopt.org>
Subject Re: lucene algorithm ?
Date Thu, 26 Apr 2012 14:21:58 GMT
On 26/04/2012 09:49, Uwe Schindler wrote:

> There are possibilities to truncate those lists, but this is not implemented
> in Lucene core. The main problem with Lucene's segmented index structure is,
> that you cannot early exit, because the very last document in the posting
> list could be one with a very high score. Techniques like index pruning or
> index sorting can optimize all this, but need index preprocessing and loose
> updatability of the index while in use.
>
> The Top-N result collection is optimized in Lucene (item #3), to throw away
> all collected documents, where the score is never be able to get into the
> top-n priority queue (too small score, once the queue is full). But the
> score has to be calculated.

This paper presents yet another option: keep a value that represents 
max. impact of postings in a block and then skip whole blocks if the max 
impact is too low to get docs within the block in the queue:

http://cis.poly.edu/suel/papers/bmw.pdf

Implementing this in Lucene would require changing the iterators so that 
they can take a threshold value, i.e. the current lowest score in the 
queue, so that nextDoc() and advance() could skip whole blocks when 
their max impact is lower than the current lowest score.

-- 
Best regards,
Andrzej Bialecki     <><
  ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message