lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Michael McCandless <>
Subject Re: Is TopDocCollector's collect() implementation correct?
Date Thu, 26 Mar 2009 10:55:18 GMT
Shai Erera <> wrote:

>> The difference is for the new code, it's an upcast, which catches any
>> errors at compile time, not run time.  The compiler determines that
>> the class implements the required interface.
> I still don't understand how a compiler can detect at compilation time that
> a HitCollector instance that is used in method A, and is casted to a
> TopDocsOutput instance by calling method B (from A) is actually ok ... I may
> be missing some Java information here, but I simply don't see how that
> happens at compilation time instead of at runtime ...

I may have lost the context here... but here's what I thought we were
talking about.

If we choose the interface option (adding a "ProvidesTopDocsResults"
(say) interface), then you would create method

Then, any collector implementing that interface could be safely passed
in, as the upcast is done at compile time not runtime.

> So in fact the internal Lucene code expects only MRHC from a certain point,
> and so even if I wrote a HC and passed it on Searcher, it's still converted
> to MRHC, with an empty setNextReader() method implementation. That's why I
> meant that HC is already deprecated, whether it's marked as deprecated or
> not.

The setNextReader() impl is not empty; it does the re-basing of docID
on behalf of the HC.

> What you say about deprecating HC to me is unnecessary. Simply pull
> setNextReader up with an empty implementation, get rid of all the
> instanceof, casting and wrapping code and you're fine. Nothing is broken.
> All works well and better (instanceof, casting and wrapping have their
> overheads).
> Isn't that right?

I think we need to deprecate HC, in favor of MRHC (or if we can think
of a better name... ResultCollector?).

> Regarding interfaces .. I don't think they're that bad. Perhaps a different
> viewing angle might help. Lucene processes a query and fires events. Today
> it fires an event every time a doc's score has been computed, and recently
> every time it starts to process a different reader. HitCollector is a
> listener implementation on the "doc-score event", while MRHC is a listener
> on both.
> To me, interfaces play just nicely here. Assume that you have the following
> interfaces:
> - DocScoreEvent
> - ChangeReaderEvent
> - EndProcessingEvent (thrown when the query has finished processing -
> perhaps to aid collectors to free resources)
> - <any other events you foresee?>
> The Lucene code receives a HitCollector which listens on all events. In the
> future, Lucene might throw other events, but still get a HitCollector. Those
> methods will check for instanceof, and you as a user will know that if you
> want to catch those events, you pass in a collector implementation which
> does. Those events cannot of course be main-stream events (like
> DocScoreEvent), but new ones, perhaps even experimental.
> Since HitCollector is a concrete class, we can always add new interfaces to
> it in the future with empty implementations?

I agree interfaces clearly have useful properties, but their achilles
heel for Lucene in all but the most trivial needs is the
non-back-compatibility when you want to add a method to the interface.
There have been a number of java-dev discussons on this problem.

So, I think something like this:

  * Deprecate HitCollector in favor of MultiReaderHitCollector (any
    better name here?) abstract class.  If you want to make a fully
    custom HitCollector, you subclass this calss.

    Let's change MRHC's collect to take only an [unbased] docID, and
    expose a setScorer(Scorer) method.  Then if the collector needs
    score it can call Scorer.score().

  * Subclass that to create an abstract "tracks top N results
    collector" (TopDocsCollector?  TopHitsCollector?)

  * Subclass TopDocsCollector to a final, fast "top N sorted by score"
    collector (exists already: TopScoreDocCollector)

  * Subclass TopDocsCollector to a final, fast "top N sorted by field"
    collector (exists already: TopFieldCollector)

  * Subclass TopDocsCollector to a "you provide your own pqueue and we
    collect top N according to it" collector (does not yet exist --
    name?).  This is the way forward for existing subclasses of

Shai do you want to take a first cut at making a patch?  Can you open
an issue?  Thanks.


To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message