Return-Path: X-Original-To: apmail-lucene-dev-archive@www.apache.org Delivered-To: apmail-lucene-dev-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 3B7D06093 for ; Mon, 16 May 2011 15:32:30 +0000 (UTC) Received: (qmail 31216 invoked by uid 500); 16 May 2011 15:32:29 -0000 Delivered-To: apmail-lucene-dev-archive@lucene.apache.org Received: (qmail 31165 invoked by uid 500); 16 May 2011 15:32:29 -0000 Mailing-List: contact dev-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@lucene.apache.org Delivered-To: mailing list dev@lucene.apache.org Received: (qmail 31158 invoked by uid 99); 16 May 2011 15:32:29 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 16 May 2011 15:32:29 +0000 X-ASF-Spam-Status: No, hits=-2000.0 required=5.0 tests=ALL_TRUSTED,T_RP_MATCHES_RCVD X-Spam-Check-By: apache.org Received: from [140.211.11.116] (HELO hel.zones.apache.org) (140.211.11.116) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 16 May 2011 15:32:26 +0000 Received: from hel.zones.apache.org (hel.zones.apache.org [140.211.11.116]) by hel.zones.apache.org (Postfix) with ESMTP id 4BD81CCC14 for ; Mon, 16 May 2011 15:31:47 +0000 (UTC) Date: Mon, 16 May 2011 15:31:47 +0000 (UTC) From: "Shai Erera (JIRA)" To: dev@lucene.apache.org Message-ID: <596216898.15487.1305559907292.JavaMail.tomcat@hel.zones.apache.org> In-Reply-To: <2099993424.13982.1305473687376.JavaMail.tomcat@hel.zones.apache.org> Subject: [jira] [Updated] (LUCENE-3102) Few issues with CachingCollector MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 [ https://issues.apache.org/jira/browse/LUCENE-3102?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shai Erera updated LUCENE-3102: ------------------------------- Attachment: LUCENE-3102.patch bq. Only thing is: I would be careful about directly setting those private fields of the cachedScorer; I think (not sure) this incurs an "access" check on each assignment. Maybe make them package protected? Or use a setter? Good catch Mike. I read about it some and found this nice webpage which explains the implications (http://www.glenmccl.com/jperf/). Indeed, if the member is private (whether it's in the inner or outer class), there is an access check. So the right think to do is to declare is protected / package-private, which I did. Thanks for the opportunity to get some education ! Patch fixes this. I intend to commit this shortly + move the class to core + apply to trunk. Then, I'll continue w/ the rest of the improvements. > Few issues with CachingCollector > -------------------------------- > > Key: LUCENE-3102 > URL: https://issues.apache.org/jira/browse/LUCENE-3102 > Project: Lucene - Java > Issue Type: Bug > Components: contrib/* > Reporter: Shai Erera > Priority: Minor > Fix For: 3.2, 4.0 > > Attachments: LUCENE-3102.patch, LUCENE-3102.patch > > > CachingCollector (introduced in LUCENE-1421) has few issues: > # Since the wrapped Collector may support out-of-order collection, the document IDs cached may be out-of-order (depends on the Query) and thus replay(Collector) will forward document IDs out-of-order to a Collector that may not support it. > # It does not clear cachedScores + cachedSegs upon exceeding RAM limits > # I think that instead of comparing curScores to null, in order to determine if scores are requested, we should have a specific boolean - for clarity > # This check "if (base + nextLength > maxDocsToCache)" (line 168) can be relaxed? E.g., what if nextLength is, say, 512K, and I cannot satisfy the maxDocsToCache constraint, but if it was 10K I would? Wouldn't we still want to try and cache them? > Also: > * The TODO in line 64 (having Collector specify needsScores()) -- why do we need that if CachingCollector ctor already takes a boolean "cacheScores"? I think it's better defined explicitly than implicitly? > * Let's introduce a factory method for creating a specialized version if scoring is requested / not (i.e., impl the TODO in line 189) > * I think it's a useful collector, which stands on its own and not specific to grouping. Can we move it to core? > * How about using OpenBitSet instead of int[] for doc IDs? > ** If the number of hits is big, we'd gain some RAM back, and be able to cache more entries > ** NOTE: OpenBitSet can only be used for in-order collection only. So we can use that if the wrapped Collector does not support out-of-order > * Do you think we can modify this Collector to not necessarily wrap another Collector? We have such Collector which stores (in-memory) all matching doc IDs + scores (if required). Those are later fed into several processes that operate on them (e.g. fetch more info from the index etc.). I am thinking, we can make CachingCollector *optionally* wrap another Collector and then someone can reuse it by setting RAM limit to unlimited (we should have a constant for that) in order to simply collect all matching docs + scores. > * I think a set of dedicated unit tests for this class alone would be good. > That's it so far. Perhaps, if we do all of the above, more things will pop up. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org For additional commands, e-mail: dev-help@lucene.apache.org