lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Michael McCandless <luc...@mikemccandless.com>
Subject Re: Is TopDocCollector's collect() implementation correct?
Date Tue, 24 Mar 2009 17:51:25 GMT
Shai Erera <serera@gmail.com> wrote:

> But what about cases like collectors chaining, extensions and
> running w/ several collectors? If each collector will need to
> request for the document's score, it might be computed over and over
> again. Consider a case for example, of a TopScoreDocCollector,
> running together w/ another collector that extracts information
> about the document, and uses the score to compute something (easy to
> write a collector which delegates the collect() call to each of
> them). Today, I could just call collect(doc, score) on each
> collector. But in the proposed way, I'd call collect(doc) and then
> each of them will need to request the score.

Yeah good point... we'd somehow need such "chaining collectors" to
simply return the score if getScore() had already been called?

> Perhaps we can introduce a collect(doc) on HitCollector which does not
> accept score, but keep the other collect? I am not sure if that's any
> better, because then the Lucene search code would need to decide to which
> collect method to call ...

Yeah not sure I like that... and it doesn't scale up well (to other
getXXX's that we may want to make optionally available from Scorer).

> Regarding the TopDocs interface (we should have a better name as
> TopDocs is already taken), my point was that just introducing an
> interface will not solve the problem. If HitCollector was an
> interface then yes, that would have solved the problem, as we could
> have introduced a new interface TopDocsCollector that extends
> HitCollector (the interface) and expose a topDocs() and
> getTotalHits() methods. But since HitCollector is an abstract class,
> this will not solve the problem.  The reason is that IndexSearcher
> accepts a HitCollector.

Actually, it was Nadav who first proposed the "read interface", to
solve the "there's no common way for reading its output" problem.
With an interface (say TopDocsOutput), then you could have some method
somewhere:

  renderResults(TopDocsOutput results)

and then any collector, independent of how it *collects* results,
could implement TopDocsOutput if appropriate.

> So the code I'd need to write is:
>
> HitCollector hc = new TopScoreDocCollector();
> searcher.search(query, hc);
> TopDocs td = ((TopDocsCollector) hc).topDocs();
>
> That case looks very strange and is completely not safe, as someone
> could change TopScoreDocCollector to not implement the
> TopDocsCollector interface, and I'd discover that only at runtime.

I agree one should rely as little as possible on runtime casting,
though we wouldn't make such a change to TopScoreDocCollector (it
breaks back compat).

> Instead, I could have written:
>
> TopDocsCollctor hc = new TopScoreDocCollector();
> searcher.search(query, hc);
> TopDocs td = hc.topDocs();
>
> If someone changes TSDC to not implement TDC, I'd get a compilation
> error and no runtime exceptions will be thrown.

I think we should aim to separate "collection", which is how Lucene
delivers hits to the collector you passed in, from "results delivery",
which is how your code talks to the collector to read back the
results.

> What I also wanted to achieve by introducing a TopDocsCollector
> abstract class is the shared implementation of topDocs() and
> getTotalHits(). All of the Top***DocCollector extensions can just
> implement collect() and use the base implementation of topDocs(). If
> there's a collector that needs a different implementation, like
> TopFieldCollector, it can override it.

Would TopDocsCollector subclass HitCollector or
MultiReaderHitCollector?

> Is my purpose better explained now?

Well... I understand the problem: we have a real bug.  If you subclass
TopDocColector or TopScoreDocCollector, and don't do your sorting by
score, then collect() does the wrong thing.  So we need to fix that.

We've gone through various options for fixing it, all of which have
various drawbacks (breaks back compat, possible performance hit,
etc.), so we're still trying to converge on the lesser evil.

Mike

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Mime
View raw message