lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Michael McCandless <>
Subject Re: Is TopDocCollector's collect() implementation correct?
Date Wed, 25 Mar 2009 13:38:14 GMT
This stuff is surprisingly hard to think about!

>> Actually, it was Nadav who first proposed the "read interface", to
>> solve the "there's no common way for reading its output" problem.
>> With an interface (say TopDocsOutput), then you could have some
>> method somewhere:
>>  renderResults(TopDocsOutput results)
>> and then any collector, independent of how it *collects* results,
>> could implement TopDocsOutput if appropriate.
> You'd still need to cast the collector to TopDocsOutput, won't you?
> How's that different than the code snippet I showed above?

The difference is for the new code, it's an upcast, which catches any
errors at compile time, not run time.  The compiler determines that
the class implements the required interface.

> The current situation introduces a bug, that's true. However, unless
> something better pops up, shouldn't we just make it final?

But that leaves no way forward for current users subclassing
TopDocCollector (for the freedom of providing your own pqueue).

> May I suggest something else? What if MRHC was actually an
> interface?

I think interface is too dangerous in this case (the future back
compatibility problem).  EG here we are wanting to explore a way to
not pre-compute the score; had we released MRHC as an interface we'd
be in trouble.  (We may still be in trouble, anyway!).

>> Would TopDocsCollector subclass HitCollector or
>> MultiReaderHitCollector?
> Well ... we've been there as well already :). I don't think there's
> an easy answer here. I guess if MRHC is the better approach, and we
> think all Top***DocCollector would want to have the MRHC
> functionality, then I'd say let's extend MRHC. Otherwise, I don't
> have a good answer. When I started this thread, I only knew of
> HitCollector, so things were simpler at the time.

We have challenging goals here:

  * The "collect top N by score" collector should be final, use
    ScorerDocQueue, specialized to sorting by score/docID: performance
    is important.

  * Likewise for the "collect top N by sorted field" collector, though
    it does provide extensibility by letting you make a custom
    comparator (FieldComparatorSource).  Ideally this'd allow with and
    without computing score (it does not today).

  * A "top N by my own pqueue" collector (this is what
    TopDocCollector/TopScoreDocsCollector allow today, but it has the

  * Allow fully custom collection, with and without score.

Maybe we should in fact simply deprecate HitCollector (in favor of
MultiReaderHitCollector)?  After all, making your own HitCollector is
an advanced thing; expecting you to properly implement setNextReader
may be fine.

And then we can subclass MultiReaderHitCollector to TopDocsCollector
(which adds the totalHits/topDocs "results delivery" API).

And then the "collect top docs by score", and "collect top docs by
fields" collectors subclass TopDocsCollector?

Finally, we add a "collect top docs according to my own pqueue"

Then we wouldn't need an interface; this works because all core
collectors deliver top N results in the end.

All that's missing is a way to NOT compute score if it's not needed.


To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message