Mailing-List: contact java-dev-help@lucene.apache.org; run by ezmlm
Precedence: bulk
Reply-To: java-dev@lucene.apache.org
Received-SPF: neutral (asf.osuosl.org: local policy)
Mime-Version: 1.0
Message-Id: <p0623092bbf3832221cfe@[192.168.1.42]>
In-Reply-To: <20050829032155.82149.qmail@web31110.mail.mud.yahoo.com>
References: <20050829032155.82149.qmail@web31110.mail.mud.yahoo.com>
Date: Sun, 28 Aug 2005 20:42:26 -0700
To: java-dev@lucene.apache.org
From: Ken Krugler <kkrugler@transpac.com>
Subject: Re: Lucene does NOT use UTF-8
Content-Type: text/plain; charset="us-ascii" ; format="flowed"

>I'm not familiar with UTF-8 enough to follow the details of this
>discussion.  I hope other Lucene developers are, so we can resolve this
>issue.... anyone raising a hand?

I could, but recent posts makes me think this is heading towards a 
religious debate :)

I think the following statements are all true:

a. Using UTF-8 for strings would make it easier for Lucene indexes to 
be used by other implementations besides the reference Java version.

b. It would be easy to tweak Lucene to read/write conformant UTF-8 strings.

c. The hard(er) part would be backwards compatibility with older 
indexes. I haven't looked at this enough to really know, but one 
example is the compound file (xx.cfs) format...I didn't see a version 
number, and it contains strings.

d. The documentation could be clearer on what is meant by the "string 
length", but this is a trivial change.

What's unclear to me (not being a Perl, Python, etc jock) is how much 
easier it would be to get these other implementations working with 
Lucene, following a change to UTF-8. So I can't comment on the return 
on time required to change things.

I'm also curious about the existing CLucene & PyLucene ports. Would 
they also need to be similarly modified, with the proposed changes?

One final point. I doubt people have been adding strings with 
embedded nulls, and text outside of the Unicode BMP is also very 
rare. So _most_ Lucene indexes only contain valid UTF-8 data. It's 
only the above two edge cases that create an interoperability problem.

-- Ken
-- 
Ken Krugler
TransPac Software, Inc.
<http://www.transpac.com>
+1 530-470-9200

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org