Return-Path: Delivered-To: apmail-lucene-java-dev-archive@www.apache.org Received: (qmail 40188 invoked from network); 2 Apr 2009 08:41:24 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.3) by minotaur.apache.org with SMTP; 2 Apr 2009 08:41:24 -0000 Received: (qmail 86088 invoked by uid 500); 2 Apr 2009 08:41:23 -0000 Delivered-To: apmail-lucene-java-dev-archive@lucene.apache.org Received: (qmail 86006 invoked by uid 500); 2 Apr 2009 08:41:23 -0000 Mailing-List: contact java-dev-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: java-dev@lucene.apache.org Delivered-To: mailing list java-dev@lucene.apache.org Received: (qmail 85998 invoked by uid 99); 2 Apr 2009 08:41:23 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 02 Apr 2009 08:41:23 +0000 X-ASF-Spam-Status: No, hits=1.2 required=10.0 tests=SPF_NEUTRAL X-Spam-Check-By: apache.org Received-SPF: neutral (nike.apache.org: local policy) Received: from [209.85.200.169] (HELO wf-out-1314.google.com) (209.85.200.169) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 02 Apr 2009 08:41:15 +0000 Received: by wf-out-1314.google.com with SMTP id 29so522009wff.20 for ; Thu, 02 Apr 2009 01:40:53 -0700 (PDT) MIME-Version: 1.0 Received: by 10.142.141.21 with SMTP id o21mr3420001wfd.227.1238661653155; Thu, 02 Apr 2009 01:40:53 -0700 (PDT) In-Reply-To: <85d3c3b60904011605l44cdf9d7k91df4f3ebd8a134d@mail.gmail.com> References: <85d3c3b60904011605l44cdf9d7k91df4f3ebd8a134d@mail.gmail.com> Date: Thu, 2 Apr 2009 04:40:53 -0400 Message-ID: <9ac0c6aa0904020140r6abfde51wc681070d6f283d24@mail.gmail.com> Subject: Re: Future projects From: Michael McCandless To: java-dev@lucene.apache.org Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: 7bit X-Virus-Checked: Checked by ClamAV on apache.org On Wed, Apr 1, 2009 at 7:05 PM, Jason Rutherglen wrote: > Now that LUCENE-1516 is close to being committed perhaps we can > figure out the priority of other issues: > > 1. Searchable IndexWriter RAM buffer I think first priority is to get a good assessment of the performance of the current implementation (from LUCENE-1516). My initial tests are very promising: with a writer updating (replacing random docs) at 50 docs/second on a full (3.2 M) Wikipedia index, I was able to get reopen the reader once per second and do a large (> 500K results) search that sorts by date. The reopen time was typically ~40 msec, and search time typically ~35 msec (though there were random spikes up to ~340 msec). Though, these results were on an SSD (Intel X25M 160 GB). We need more datapoints of the current approach, but this looks likely to be good enough for starters. And since we can get it into 2.9, hopefully it'll get some early usage and people will report back to help us assess whether further performance improvements are necessary. If they do turn out to be necessary, I think before your step 1, we should write small segments into a RAMDirectory instead of the "real" directory. That's simpler than truly searching IndexWriter's in-memory postings data. > 2. Finish up benchmarking and perhaps implement passing > filters to the SegmentReader level What is "passing filters to the SegmentReader level"? EG as of LUCENE-1483, we now ask a Filter for it's DocIdSet once per SegmentReader. > 3. Deleting by doc id using IndexWriter We need a clean approach for the "docIDs suddenly shift when merge is committed" problem for this... Thinking more on this... I think one possible solution may be to somehow expose IndexWriter's internal docID remapping code. IndexWriter does delete by docID internally, and whenever a merge is committed we stop-the-world (sync on IW) and go remap those docIDs. If we somehow allowed user to register a callback that we could call when this remapping occurs, then user's code could carry the docIDs without them becoming stale. Or maybe we could make a class "PendingDocIDs", which you'd ask the reader to give you, that holds docIDs and remaps them after each merge. The problem is, IW internally always logically switches to the current reader for any further docID deletion, but the user's code may continue to use an old reader. So simply exposing this remapping won't fix it... we'd need to somehow track the genealogy (quite a bit more complex). > With 1) I'm interested in how we will lock a section of the > bytes for use by a given reader? We would not actually lock > them, but we need to set aside the bytes such that for example > if the postings grows, TermDocs iteration does not progress to > beyond it's limits. Are there any modifications that are needed > of the RAM buffer format? How would the term table be stored? We > would not be using the current hash method? I think the realtime reader'd just store the maxDocID it's allowed to search, and we would likely keep using the RAM format now used. Mike --------------------------------------------------------------------- To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org For additional commands, e-mail: java-dev-help@lucene.apache.org