lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Mikhail Khludnev <mkhlud...@griddynamics.com>
Subject minShouldMatch can't leapfrog
Date Tue, 13 Nov 2012 19:59:21 GMT
Developers,

I want to discuss few points regarding disjunction form of BooleanQuery
with minShouldMatch constraint. I'm talking about doc-at-time evaluation
only (BooleanScorer2).
Look into conjunction query which has disjunction in one of its' clauses
e.g. +foo +(bar baz …). if disjunction "(bar baz … )" has high
minShouldMatch constraint even if conjunction clause +foo is highly
selective this query performs quite bad. It also happens if instead of +foo
you have a filter. Once again: it's reasonable that disjunction even with
high minShouldMatch is expensive, if "core" disjunction with
minShouldMatch=1 matches few millions of docs. The problem is that I can't
speed it up by supplying highly selective filter.
>From my POV there two points in Lucene API which make leapfrog impossible:
- advance() is obliged to return next matching doc that causes scan in
nextDoc() loop. It's great to have something like advanceExact(), or return
some magic value from advance says - "failed to advance, propose next for
leapfrog";
- Scorer is obliged to jump on the first matching doc after it's created
that leads to scan many docs in nextDoc() loop;
- ConjunctiveScorer can't know which of its legs are not able to leapfrog
and prefer to decline advance()
- Stefan spotted one more gain for minShouldMatch in DisjunctionSumScorer:
We don't need to walk through top of heap after top scorer is advanced.
Instead of this, let's pop minShouldMatch scores from the heap, look at the
top after this - top doc can be evaluated as potential match which exceeds
minShouldMatch constraint. After that, let's push those guys back to the
heap advancing them to that candidate docnum. It's not cohesioned with
leapfrog problem, though.

This last one looks impressing, but I'm not smart enough to realize that it
gives performance gain. Do you think it's valuable optimization for Lucene
users? How minShouldMatch is popular,btw? To be honest I'm not really
suffer form minShouldMatch itself, I have query with my own match
verification logic and therefore lack of leapfrog bothers me much.

Looking forward for you feedback!
-- 
Sincerely yours
Mikhail Khludnev
Principal Engineer,
Grid Dynamics

<http://www.griddynamics.com>
 <mkhludnev@griddynamics.com>

Mime
View raw message