lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Michael McCandless <>
Subject Re: Is TopDocCollector's collect() implementation correct?
Date Mon, 23 Mar 2009 18:56:15 GMT
Shai Erera <> wrote:

> As a side comment, why not add setNextReader to HitCollector and
> then a getDocId(int doc) method which will do the doc + base
> arithmetic?

One problem is this breaks back compatibility on any current
subclasses of HitCollector.

Another problem is: not all collectors would need to add the base on
each doc.  EG a collector that puts hits into separate pqueues per
segment could defer the addition until the end when only the top
results are pulled out of each pqueue.

Also, I am concerned about the method call overhead.  This is the
absolute ultimate hot spot for Lucene and we should worry about
causing even a single added instruction in this path.

That said... I would like to [eventually] change the collection API
along the lines of what Marvin proposed for "Matcher" in Lucy, here:

Specifically, I think it should be the collector's job to ask for the
score for this doc, rather than Lucene's job to pre-compute it, so
that collectors that don't need the score won't waste CPU.  EG, if you
are sorting by field (and don't present the relevance score) you
shouldn't compute it.

Then, we could add other "somewhat expensive" things you might
retrieve, such as a way to ask which terms participated in the match
(discussed today on java-user), and/or all term positions that
participated (discussed in LUCENE-1522).  EG, a top doc collector
could choose to call these methods only when the doc was competitive.

> Anyway, I don't want to add topDocs and getTotalHits to
> HitCollector, it will destroy its generic purpose.

I agree.

> An interface is also problematic, as it just means all of these
> collectors have these methods declared, but they need to implement
> them. An abstract class grants you w/ both.

I'm confused on this objection -- only collectors that do let you ask
for the top N set of docs would implement this interface?  (Ie it'd
only be the TopXXXCollector's that'd implement the interface).  While
interfaces clearly have the future problem of back-compatibility, this
case may be simple enough to make an exception.

> So it looks like HitCollector itself is "deprecated" as far as the
> Lucene core code sees it.

I think HitCollector has a purpose, which is to be the simplest way to
make a custom collector.  Ie I think it makes sense to offer a simple
way and a high performance way.


To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message