Return-Path: Delivered-To: apmail-lucene-java-dev-archive@www.apache.org Received: (qmail 50005 invoked from network); 29 Aug 2005 08:01:40 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (209.237.227.199) by minotaur.apache.org with SMTP; 29 Aug 2005 08:01:40 -0000 Received: (qmail 24843 invoked by uid 500); 29 Aug 2005 08:01:36 -0000 Delivered-To: apmail-lucene-java-dev-archive@lucene.apache.org Received: (qmail 24751 invoked by uid 500); 29 Aug 2005 08:01:36 -0000 Mailing-List: contact java-dev-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: java-dev@lucene.apache.org Delivered-To: mailing list java-dev@lucene.apache.org Delivered-To: moderator for java-dev@lucene.apache.org Received: (qmail 1599 invoked by uid 99); 29 Aug 2005 03:42:36 -0000 X-ASF-Spam-Status: No, hits=0.0 required=10.0 tests= X-Spam-Check-By: apache.org Received-SPF: neutral (asf.osuosl.org: local policy) Mime-Version: 1.0 Message-Id: In-Reply-To: <20050829032155.82149.qmail@web31110.mail.mud.yahoo.com> References: <20050829032155.82149.qmail@web31110.mail.mud.yahoo.com> Date: Sun, 28 Aug 2005 20:42:26 -0700 To: java-dev@lucene.apache.org From: Ken Krugler Subject: Re: Lucene does NOT use UTF-8 Content-Type: text/plain; charset="us-ascii" ; format="flowed" X-Virus-Checked: Checked by ClamAV on apache.org X-Spam-Rating: minotaur.apache.org 1.6.2 0/1000/N >I'm not familiar with UTF-8 enough to follow the details of this >discussion. I hope other Lucene developers are, so we can resolve this >issue.... anyone raising a hand? I could, but recent posts makes me think this is heading towards a religious debate :) I think the following statements are all true: a. Using UTF-8 for strings would make it easier for Lucene indexes to be used by other implementations besides the reference Java version. b. It would be easy to tweak Lucene to read/write conformant UTF-8 strings. c. The hard(er) part would be backwards compatibility with older indexes. I haven't looked at this enough to really know, but one example is the compound file (xx.cfs) format...I didn't see a version number, and it contains strings. d. The documentation could be clearer on what is meant by the "string length", but this is a trivial change. What's unclear to me (not being a Perl, Python, etc jock) is how much easier it would be to get these other implementations working with Lucene, following a change to UTF-8. So I can't comment on the return on time required to change things. I'm also curious about the existing CLucene & PyLucene ports. Would they also need to be similarly modified, with the proposed changes? One final point. I doubt people have been adding strings with embedded nulls, and text outside of the Unicode BMP is also very rare. So _most_ Lucene indexes only contain valid UTF-8 data. It's only the above two edge cases that create an interoperability problem. -- Ken -- Ken Krugler TransPac Software, Inc. +1 530-470-9200 --------------------------------------------------------------------- To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org For additional commands, e-mail: java-dev-help@lucene.apache.org