lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Shai Erera (JIRA)" <>
Subject [jira] [Updated] (LUCENE-3102) Few issues with CachingCollector
Date Mon, 16 May 2011 12:00:48 GMT


Shai Erera updated LUCENE-3102:

    Attachment: LUCENE-3102.patch

Patch includes the bug fixes + test. Still none of the items I listed after 'Also ...'. I
plan to tackle that next, in subsequent patches.

Question -- perhaps we can commit these changes incrementally? I.e., after we iterate on the
changes in this patch, if they are ok, commit them, then do the rest of the stuff? Or a single
commit w/ everything is preferable?

Mike, there is another reason to separate Collector.needsScores() from cacheScores -- it is
possible someone will pass a Collector which needs scores, however won't want to have CachingCollector
'cache' them. In which case, the wrapped Collector should be delegated setScorer instead of

I will leave Collector.needsScores() for a different issue though?

> Few issues with CachingCollector
> --------------------------------
>                 Key: LUCENE-3102
>                 URL:
>             Project: Lucene - Java
>          Issue Type: Bug
>          Components: contrib/*
>            Reporter: Shai Erera
>            Priority: Minor
>             Fix For: 3.2, 4.0
>         Attachments: LUCENE-3102.patch
> CachingCollector (introduced in LUCENE-1421) has few issues:
> # Since the wrapped Collector may support out-of-order collection, the document IDs cached
may be out-of-order (depends on the Query) and thus replay(Collector) will forward document
IDs out-of-order to a Collector that may not support it.
> # It does not clear cachedScores + cachedSegs upon exceeding RAM limits
> # I think that instead of comparing curScores to null, in order to determine if scores
are requested, we should have a specific boolean - for clarity
> # This check "if (base + nextLength > maxDocsToCache)" (line 168) can be relaxed?
E.g., what if nextLength is, say, 512K, and I cannot satisfy the maxDocsToCache constraint,
but if it was 10K I would? Wouldn't we still want to try and cache them?
> Also:
> * The TODO in line 64 (having Collector specify needsScores()) -- why do we need that
if CachingCollector ctor already takes a boolean "cacheScores"? I think it's better defined
explicitly than implicitly?
> * Let's introduce a factory method for creating a specialized version if scoring is requested
/ not (i.e., impl the TODO in line 189)
> * I think it's a useful collector, which stands on its own and not specific to grouping.
Can we move it to core?
> * How about using OpenBitSet instead of int[] for doc IDs?
> ** If the number of hits is big, we'd gain some RAM back, and be able to cache more entries
> ** NOTE: OpenBitSet can only be used for in-order collection only. So we can use that
if the wrapped Collector does not support out-of-order
> * Do you think we can modify this Collector to not necessarily wrap another Collector?
We have such Collector which stores (in-memory) all matching doc IDs + scores (if required).
Those are later fed into several processes that operate on them (e.g. fetch more info from
the index etc.). I am thinking, we can make CachingCollector *optionally* wrap another Collector
and then someone can reuse it by setting RAM limit to unlimited (we should have a constant
for that) in order to simply collect all matching docs + scores.
> * I think a set of dedicated unit tests for this class alone would be good.
> That's it so far. Perhaps, if we do all of the above, more things will pop up.

This message is automatically generated by JIRA.
For more information on JIRA, see:

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message