lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Shai Erera (JIRA)" <>
Subject [jira] Commented: (LUCENE-2215) paging collector
Date Tue, 23 Mar 2010 20:35:27 GMT


Shai Erera commented on LUCENE-2215:

I've reviewed and the first thing I have to say about it is that I really
like it ! :) Saves lots of unnecessary heapify code, if the application can allow itself to
store the lowest last SD.

I have few comments/questions.

I don't understand what getLastScoreDoc is for? Is it just a utility method? Is it something
the app can compute by itself? Anyway, it lacks javadocs, so perhaps if they existed I wouldn't
need to ask ;).

In collect(), there's the following code:
		} else if (score == previousPassLowest.score && doc <= previousPassLowest.doc)
			// if the scores are the same and the doc is less than or equal to
			// the
			// previous pass lowest hit doc then skip because this collector
			// favors
			// lower number documents.

I think there's a typo in the comment "favors lower number documents" .. while it seems to
prefer higher doc IDs? The way I understand it, irregardless of whether docs are collected
in/out of order, HitQueue ensures that when scores are equals, the lowest IDs are favored.
Thus the first round always keeps the lowest IDs among the docs whose scores match. The next
round will favor the docs whose IDs come next, and so forth ... am I right? (just clarifying
my understanding).
If that's the case, I think it'll be good if it's spelled out in the comment, and also mention
that it means that document has already been returned previously (like it's documented in
the previous 'if').

The last 'else' really looks like TSDC's out-of-order version, which makes me think whether
PagingCollector can be viewed as a filter on top of TSDC (and possibly even TopFieldCollector)?
So if a hit should be collected, it just calls super.collect? I realize though that a Collector
is a hotspot and we want to minimize 'if' let alone method call statements as much as possible.
But it just feels so strong that it should be a filter ... :). And you wouldn't need to specifically
handle in/out orderness ... and w/ the right design, it can also wrap a TFC or any other TDC
implementation ...

BTW, I've noticed that you don't track maxScore - is it assumed that the application stores
it from the first round? If so I'd document it, because the application needs to know it should
use TSDC the first round, and PagingCollector the second round.

Also, PagingCollector offers a ctor which does not force the application to pass in a ScoreDoc.
See my comment from above - it might be misleading, because if you use this collector right
from the very first search, you lose the maxScore tracking. I also don't see why it should
be allowed - if a dummy previousPassLowest ScoreDoc is used, collect() does a lot of unnecessary
'if's. I think this collector should be used only from the second round, and a single ctor
which forces a ScoreDoc to be passed would make more sense. If the application wishes to shoot
itself in the leg (performance-wise), it can pass a dummy SD itself.

> paging collector
> ----------------
>                 Key: LUCENE-2215
>                 URL:
>             Project: Lucene - Java
>          Issue Type: New Feature
>          Components: Search
>    Affects Versions: 2.4, 3.0
>            Reporter: Adam Heinz
>            Assignee: Grant Ingersoll
>            Priority: Minor
>         Attachments:,,
> Somebody assign this to Aaron McCurry and we'll see if we can get enough votes on this
issue to convince him to upload his patch.  :)

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message