lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Karl Wright <daddy...@yahoo.com>
Subject Re: Submission, btree BooleanScorer
Date Sun, 22 May 2005 11:12:01 GMT
The BooleanScorer2 code in the svn trunk looks like it has been entirely rewritten over the
code that this submission is based upon (which was the 1.4.3 stuff).  The techniques used
may still be applicable, but would apply within one or more of the specialized scorers instead:
DisjunctionScorer, ReqOptScorer, etc.
 
Karl


Paul Elschot <paul.elschot@xs4all.nl> wrote:
On Sunday 22 May 2005 03:09, Karl Wright wrote:
> I've been looking at the BooleanScorer code in 1.4.3 and realized that it 
has several problems. These are:
> 
> 1) It does things in chunks of 1024 document ids. This means it executes in 
a time that depends on the number of indexed documents.
> 2) Finding the subscorer with the lowest document id scales linearly with 
the number of scorers (corresponding to clauses in the Boolean query)
> 3) It does not implement the skipTo() method, because its technique of 
doiing 1024 document id's at a time interferes with this. This makes it 
impossible to use a BooleanScorer within a Conjunction Scorer.
> 
> I've attached a rewritten BooleanScorer which solves these problems. It 
basically uses a btree to keep the individual subscorers, and it removes 
subscorers that have reached the end of their documents. It thus removes the 
dependency on the number of documents indexed, and it performs in 
O(log(number of clauses)) instead of O(number of clauses).

Great. For disjunctions the BooleanScorer in the svn trunk implements
skipTo and it uses a priority queue (a heap) so the base of the log is 2.
Is the btree faster than that?
I had a short look at the code of addToTree, and the btree seems to be
unbalanced, so it could theoretically degrade to linear performance,
but in practice it could be good.
Did you try it with a PrefixQuery or a RangeQuery that expands to a lot
of terms?

One way to find out about performance of various boolean scorers
is to start from the TestDisjunctionPerf1.java here:
http://issues.apache.org/bugzilla/show_bug.cgi?id=34193
This might involve some class renaming in order to be able to
run under the same class loader.

Working in chunks as the 1.4.3 scorer does still has the best
performance afaik.

The svn trunk also has a test case TestBoolean2 in the test
package org.apache.lucene.search.
I hope the btree version passes this, it caught a few initial mistakes
in the current BooleanScorer2.
Still, I think there is a shortage of test cases for the boolean query and
scorers.

The version posted here:
http://issues.apache.org/bugzilla/show_bug.cgi?id=34154
splits off the coordination computations in case they are not
needed, which might affect performance also.

Regards,
Paul Elschot


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


__________________________________________________
Do You Yahoo!?
Tired of spam?  Yahoo! Mail has the best spam protection around 
http://mail.yahoo.com 
Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message