lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Michael McCandless <>
Subject Re: Is TopDocCollector's collect() implementation correct?
Date Thu, 26 Mar 2009 14:23:27 GMT
Shai Erera <> wrote:
>> I may have lost the context here... but here's what I thought we were
>> talking about.
>> If we choose the interface option (adding a "ProvidesTopDocsResults"
>> (say) interface), then you would create method
>> renderResults(ProvidesTopDocResults).
>> Then, any collector implementing that interface could be safely passed
>> in, as the upcast is done at compile time not runtime.
> Consider this code snippet:
> HitCollector hc = condition? new TopDocCollector() : TopFieldDocCollector();
> The problem is that I need a base class for both collectors. If I use the
> interface ProvidesTopDocsResults, then I cannot pass it to searcher, w/o
> casting to HitCollector. If I use HitCollector, then I need to cast it
> before passing it into rederResults(). Only when both class have the same
> base class which is also a HitCollector, I don't need any casting. I.e.,
> after I submit a patch that develops what we've agreed on, the code can look
> like this:
> TopResultsCollector trc = condition ? new TopScoreDocCollector() : new
> TopFieldCollector();
> renderResults(trc);
> Here I can pass 'trc' to both methods since it both a HitCollector and a
> TopResultsCollector. That's what I was missing in your proposal.

OK I agree.  And with our proposed changes (TopResultsCollector), you
can do this.

>> Shai do you want to take a first cut at making a patch?  Can you open
>> an issue?  Thanks.
> I can certainly do that. I think the summary of the steps make sense. I'll
> check if TopScoreDocCollector and TopFieldCollector can also extend that
> "you provide your own pqueue and we collect top N according to it"
> collector, passing a null PQ and extending topDocs().

OK, thanks.

> I also would like to propose an additional method to topDocs(), topDocs(int
> start, int howMany) which will be more efficient to call in case of paging
> through results. The reason is that topDocs() pops everything from the PQ,
> then allocates a ScoreDoc[] array of size of number of results and returns a
> TopDocs object. You can then choose just the ones you need.
> On the other hand, topDocs(start, howMany) would allocate exactly the size
> of array needed. E.g., in case someone pages through results 91-100, you
> allocate an array of size 10, instead of 100.
> It is not a huge improvement, but it does save some allocations, as well as
> it's a convenient method.

Though... this is somewhat tricky to implement efficiently when using
pqueue: you pop the worst scoring hit first, then next worst scoring,
etc, into an array (in reverse order).  It would be conceivable to do
a [separate] partial sort of the queue to more efficiently retrieve a
top subset of N to save some time on the extraction.  But my guess is
extraction time is trivial; I don't think we need to optimize it.

That being said, we could make the API like this, but under the hood
simply do what we do today the first time it's called, leaving as a
future optimization to speed it up.

Alternatively we could make a ScoreDoc result(int n) to retrieve each
result one by one... or maybe doc(int n) and score(int n) since some
collectors won't score (but, then we'd need to handle FieldDoc, which
is used to more generically return the sort field values for each

> BTW, I like the name ResultsCollector, as it's just like HitCollector, but
> does not commit too much to "hits" .. i.e., facets aren't hits ... I think?

I like it too!


To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message