incubator-lucy-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Marvin Humphrey <>
Subject Re: Matcher
Date Fri, 13 Mar 2009 00:47:22 GMT
On Fri, Feb 06, 2009 at 05:55:39AM -0500, Michael McCandless wrote:

> One thing I'm struggling with now (in Lucene), that I'm not sure applies to
> KS/Lucy, is how to best take "random access Matchers" into account.  Not all
> Matchers are created equal.  Some really want to do the next()/skipTo()
> iteration (eg a posting list), but others (deleted docs currently, or cached
> filters, in Lucene) are "best" accessed by random access accept(int docID)
> API, 

---->8 SNIP 8<----

> I'm seeing sizable performance gains by "taking advantage" of Matchers that
> could expose a random access API.  It would be good if we could query a
> Matcher to see if it supports random access...

I hadn't replied to this before now for the same reason I hadn't posted a
response to the JIRA issue you opened: I'm not fond of the proposed solution,
but I don't have any better ideas. :(

Reaching into a Matcher and choosing which path to go down depending on
whether it supports random access or not is going to result in a lot of
duplicated code.  Everywhere we want to optimize, we need one track for the
random access method, and a fallback track for the iterator.  Furthermore, the
tracks aren't very similar, so the complexity cost to realize the optimization
is high.  

There's a classic OO code simplification technique you may be familiar with:
"replace conditionals with polymorphism".  This is the exact opposite of that:
"replace polymorphism with conditionals".  Ick.

At present, I don't have the necessary benchmarking tools at my disposal to
verify that your benchmarking tests also hold true for C.  Unfortunately, I
suspect they do; results might even be worse if there's some automatic bounds
checking needed by Java that we can figure out how to forego in C.

I'm waiting for some intuitive leap to come along and rescue me.

> Another thing you might want to add to Matcher is "approxCount()",
> which would make a rough guess at the possible number of matches, to
> help a future "matching optimizer" pick a strategy.

I can see us doing this within the core.  It could be handy when say, deciding
whether to invert a sparse filter and apply it using "next match" instead of
"next miss".

However, Matcher is supposed to be a public class and I would be reluctant to
add approxCount() as a public method.  Since Lucy cannot assume that its
target platforms support sane deprecation, once we add a public method it's
there forever.  Therefore, we need to be conservative about exposing public
methods in general, and extremely conservative about exposing methods whose
sole purpose is optimization.

Marvin Humphrey

View raw message