lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Shai Erera (JIRA)" <j...@apache.org>
Subject [jira] Commented: (LUCENE-1575) Refactoring Lucene collectors (HitCollector and extensions)
Date Mon, 30 Mar 2009 12:06:50 GMT

    [ https://issues.apache.org/jira/browse/LUCENE-1575?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12693743#action_12693743
] 

Shai Erera commented on LUCENE-1575:
------------------------------------

Ok I now understand better where score is used in TopFieldCollector ... It is used in a number
of places, two important are:
# Maintain maxScore for returning in TopFieldDocs.
# Passed to FieldComparator.compareBottom which takes a doc and score parameters. There is
one comparator RelevanceComparator which makes use of the score passed, however that's part
of the method signature.
# Passed to FieldComparator.copy - again used by RelevanceComparator only.
# Passed to updateBottom, which updates the score of the least element in the queue and then
calls adjustTop().

* Number 2, 3 and 4 can be resolved by adding a setScorer to FieldComparator (as empty implementation)
which TopFieldCollector will call in each collect() call, passing the Scorer that was given
to it in its setScorer.
* Then, we override that in RelevanceComparator, saving the Scorer and using it whenever the
score is needed. Of course we'll need to save the current score, so that we don't call score()
too many times for the same document.
* This eliminates the need to define on TopFieldCollector whether scores should be saved.
The reason is that the Sort parameter may include a SortField.SCORE field, which will invoke
the RelevanceComparator.

The question is what to do with maxScore? It is needed for TopDocs / TopFieldDocs. It may
also be important to know the maxScore of a query, even if you sort it by something which
is not a score.

Question is - if the steps above make sense, why should we do them at all? :)
Now the score is computed and passed on to every FieldComparator we received in Sort. Cleaning
the method signature means additional code overhead in RelevanceComparator. If we want to
compute maxScore as well, it means the score will be computed twice, once in collect() and
once in RelevanceComparator.

We can solve the double score() computation by using an internal ScoreCacheScorer which keeps
the score of the current document and returns it whenever score() is called, unless it's a
new document and then it delegates the call to the wrapped Scorer. TopFieldCollector can instantiate
it in setScorer.

But this looks quite a lot for cleaning a method signature, don't you think? Of course if
you can suggest how we somehow remove the maxScore computation, then it might be a good change,
since only if SortField.SCORE is used, will the score be computed.

> Refactoring Lucene collectors (HitCollector and extensions)
> -----------------------------------------------------------
>
>                 Key: LUCENE-1575
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1575
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Search
>            Reporter: Shai Erera
>             Fix For: 2.9
>
>
> This issue is a result of a recent discussion we've had on the mailing list. You can
read the thread [here|http://www.nabble.com/Is-TopDocCollector%27s-collect()-implementation-correct--td22557419.html].
> We have agreed to do the following refactoring:
> * Rename MultiReaderHitCollector to Collector, with the purpose that it will be the base
class for all Collector implementations.
> * Deprecate HitCollector in favor of the new Collector.
> * Introduce new methods in IndexSearcher that accept Collector, and deprecate those that
accept HitCollector.
> ** Create a final class HitCollectorWrapper, and use it in the deprecated methods in
IndexSearcher, wrapping the given HitCollector.
> ** HitCollectorWrapper will be marked deprecated, so we can remove it in 3.0, when we
remove HitCollector.
> ** It will remove any instanceof checks that currently exist in IndexSearcher code.
> * Create a new (abstract) TopDocsCollector, which will:
> ** Leave collect and setNextReader unimplemented.
> ** Introduce protected members PriorityQueue and totalHits.
> ** Introduce a single protected constructor which accepts a PriorityQueue.
> ** Implement topDocs() and getTotalHits() using the PQ and totalHits members. These can
be used as-are by extending classes, as well as be overridden.
> ** Introduce a new topDocs(start, howMany) method which will be used a convenience method
when implementing a search application which allows paging through search results. It will
also attempt to improve the memory allocation, by allocating a ScoreDoc[] of the requested
size only.
> * Change TopScoreDocCollector to extend TopDocsCollector, use the topDocs() and getTotalHits()
implementations as they are from TopDocsCollector. The class will also be made final.
> * Change TopFieldCollector to extend TopDocsCollector, and make the class final. Also
implement topDocs(start, howMany).
> * Change TopFieldDocCollector (deprecated) to extend TopDocsCollector, instead of TopScoreDocCollector.
Implement topDocs(start, howMany)
> * Review other places where HitCollector is used, such as in Scorer, deprecate those
places and use Collector instead.
> Additionally, the following proposal was made w.r.t. decoupling score from collect():
> * Change collect to accecpt only a doc Id (unbased).
> * Introduce a setScorer(Scorer) method.
> * If during collect the implementation needs the score, it can call scorer.score().
> If we do this, then we need to review all places in the code where collect(doc, score)
is called, and assert whether Scorer can be passed. Also this raises few questions:
> * What if during collect() Scorer is null? (i.e., not set) - is it even possible?
> * I noticed that many (if not all) of the collect() implementations discard the document
if its score is not greater than 0. Doesn't it mean that score is needed in collect() always?
> Open issues:
> * The name for Collector
> * TopDocsCollector was mentioned on the thread as TopResultsCollector, but that was when
we thought to call Colletor ResultsColletor. Since we decided (so far) on Collector, I think
TopDocsCollector makes sense, because of its TopDocs output.
> * Decoupling score from collect().
> I will post a patch a bit later, as this is expected to be a very large patch. I will
split it into 2: (1) code patch (2) test cases (moving to use Collector instead of HitCollector,
as well as testing the new topDocs(start, howMany) method.
> There might be even a 3rd patch which handles the setScorer thing in Collector (maybe
even a different issue?)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Mime
View raw message