Return-Path: Delivered-To: apmail-lucene-java-dev-archive@www.apache.org Received: (qmail 59643 invoked from network); 24 Dec 2008 19:03:37 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.2) by minotaur.apache.org with SMTP; 24 Dec 2008 19:03:37 -0000 Received: (qmail 17922 invoked by uid 500); 24 Dec 2008 19:03:31 -0000 Delivered-To: apmail-lucene-java-dev-archive@lucene.apache.org Received: (qmail 17882 invoked by uid 500); 24 Dec 2008 19:03:30 -0000 Mailing-List: contact java-dev-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: java-dev@lucene.apache.org Delivered-To: mailing list java-dev@lucene.apache.org Received: (qmail 17873 invoked by uid 99); 24 Dec 2008 19:03:30 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 24 Dec 2008 11:03:30 -0800 X-ASF-Spam-Status: No, hits=-0.0 required=10.0 tests=SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (nike.apache.org: local policy) Received: from [68.116.39.25] (HELO rectangular.com) (68.116.39.25) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 24 Dec 2008 19:03:23 +0000 Received: from marvin by rectangular.com with local (Exim 4.63) (envelope-from ) id 1LFZ0W-0006D5-VI for java-dev@lucene.apache.org; Wed, 24 Dec 2008 11:03:00 -0800 Date: Wed, 24 Dec 2008 11:03:00 -0800 To: java-dev@lucene.apache.org Subject: Re: Realtime Search Message-ID: <20081224190300.GA23787@rectangular.com> References: <85d3c3b60812231751k60f00283r95b8d65b2b7adf45@mail.gmail.com> <20081224022229.GA17788@rectangular.com> <23E675E5-06AB-445F-B2E1-3755FCED8CBD@ix.netcom.com> <20081224032044.GA18006@rectangular.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: User-Agent: Mutt/1.5.13 (2006-08-11) From: Marvin Humphrey X-Virus-Checked: Checked by ClamAV on apache.org On Tue, Dec 23, 2008 at 11:02:56PM -0600, robert engels wrote: > Seems doubtful you will be able to do this without increasing the > index size dramatically. Since it will need to be stored > "unpacked" (in order to have random access), yet the terms are > variable length - leading to using a maximum=minimum size for every > term. Wow. That's a spectacularly awful design. Its worst case -- one outlier term, say, 1000 characters in length, in a field where the average term length is in the single digits -- would explode the index size and incur wasteful IO overhead, just as you say. Good thing we've never considered it. :) I'm hoping we can improve on this, but for now, we've ended up at a two-file design for the term dictionary index. 1) Stacked 64-bit file pointers. 2) Variable length character and term info data, interpreted using a pluggable codec. In the index at least, each entry would contain the full term text, encoded as UTF-8. Probably the primary term dictionary would continue to use string diffs. That design offers no significant benefits other than those that flow from compatibility with mmap: faster IndexReader open/reaopen, lower RAM usage under multiple processes by way of buffer sharing. IO bandwidth requirements and speed are probably a little better, but lookups on the term dictionary index are not a significant search-time bottleneck. Additionally, sort caches would be written at index time in three files, and memory mapped as laid out in . 1) Stacked 64-bit file pointers. 2) Character data. 3) Doc num to ord mapping. Marvin Humphrey --------------------------------------------------------------------- To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org For additional commands, e-mail: java-dev-help@lucene.apache.org