Return-Path: Delivered-To: apmail-lucene-java-dev-archive@www.apache.org Received: (qmail 34060 invoked from network); 6 May 2006 16:27:36 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (209.237.227.199) by minotaur.apache.org with SMTP; 6 May 2006 16:27:36 -0000 Received: (qmail 92766 invoked by uid 500); 6 May 2006 16:27:33 -0000 Delivered-To: apmail-lucene-java-dev-archive@lucene.apache.org Received: (qmail 92734 invoked by uid 500); 6 May 2006 16:27:33 -0000 Mailing-List: contact java-dev-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: java-dev@lucene.apache.org Delivered-To: mailing list java-dev@lucene.apache.org Received: (qmail 92723 invoked by uid 99); 6 May 2006 16:27:33 -0000 Received: from asf.osuosl.org (HELO asf.osuosl.org) (140.211.166.49) by apache.org (qpsmtpd/0.29) with ESMTP; Sat, 06 May 2006 09:27:33 -0700 X-ASF-Spam-Status: No, hits=0.0 required=10.0 tests= X-Spam-Check-By: apache.org Received-SPF: pass (asf.osuosl.org: local policy) Received: from [12.154.210.214] (HELO rectangular.com) (12.154.210.214) by apache.org (qpsmtpd/0.29) with ESMTP; Sat, 06 May 2006 09:27:32 -0700 Received: from marvin by rectangular.com with local (Exim 4.44) id 1FcPgr-0004Rc-3x for java-dev@lucene.apache.org; Sat, 06 May 2006 09:31:33 -0700 Date: Sat, 6 May 2006 09:31:33 -0700 From: Marvin Humphrey To: java-dev@lucene.apache.org Subject: Re: bytecount as prefix Message-ID: <20060506163133.GA16771@rectangular.com> References: <4728322A-AD68-4388-955C-F6BED7861B6F@rectangular.com> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: User-Agent: Mutt/1.4.2.1i X-Virus-Checked: Checked by ClamAV on apache.org X-Spam-Rating: minotaur.apache.org 1.6.2 0/1000/N On Sat, May 06, 2006 at 05:11:02PM +0900, David Balmain wrote: > Hi Marvin, > > Where are you with this? I also have a vested interest in seeing > Lucene move to using byte counts. I was wondering if I could help out. > Is the patch you pasted here the latest you have? All I've added since then is debugging code. Including some last night. As I mentioned in another thread, this is going to be a multi-stage process. The goal of that first patch is to have Lucene using bytecounts everywhere (except for TermVectors, just because it isn't strictly necessary). Lucene will be slower after it is [fixed, completed and] applied. The next stage will involve finding optimizations to return Lucene to at least its prior speed. The primary target is segment merger. Looking ahead, it will be interesting to see how many advantages of working with term text as bytestrings can be realized. Lazy loading of fields should be an obvious winner. The cached .tii in TermInfosReader could potentially occupy a lot less RAM if your text takes up less space in UTF-8 than in chars. And it becomes theoretically possible to have Lucene use an arbitrary encoding for character data in the index, rather than only UTF-8. The intended mechanics of that patch should be plain enough. I'm going to take another crack at seeing what's wrong with it today. If somebody beats me to a solution, I won't complain. :) Marvin Humphrey Rectangular Research http://www.rectangular.com/ --------------------------------------------------------------------- To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org For additional commands, e-mail: java-dev-help@lucene.apache.org