lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Erik Hatcher <e...@ehatchersolutions.com>
Subject Re: Lucene does NOT use UTF-8
Date Mon, 29 Aug 2005 08:30:05 GMT
On Aug 28, 2005, at 11:42 PM, Ken Krugler wrote:
>> I'm not familiar with UTF-8 enough to follow the details of this
>> discussion.  I hope other Lucene developers are, so we can resolve  
>> this
>> issue.... anyone raising a hand?
>>
>
> I could, but recent posts makes me think this is heading towards a  
> religious debate :)

Ken - you mentioned taking the discussion off-line in a previous  
post.  Please don't.  Let's keep it alive on java-dev until we have a  
resolution to it.

> I think the following statements are all true:
>
> a. Using UTF-8 for strings would make it easier for Lucene indexes  
> to be used by other implementations besides the reference Java  
> version.
>
> b. It would be easy to tweak Lucene to read/write conformant UTF-8  
> strings.

What, if any, performance impact would changing Java Lucene in this  
regard have?   (I realize this is rhetorical at this point, until a  
solution is at hand)

> c. The hard(er) part would be backwards compatibility with older  
> indexes. I haven't looked at this enough to really know, but one  
> example is the compound file (xx.cfs) format...I didn't see a  
> version number, and it contains strings.

I don't know the gory details, but we've made compatibility breaking  
changes in the past and the current version of Lucene can open older  
formats, but only write the most current format.  I suspect it could  
be made to be backwards compatible.  Worst case, we break  
compatibility in 2.0.

> d. The documentation could be clearer on what is meant by the  
> "string length", but this is a trivial change.

That change was made by Daniel soon after this discussion began.

> What's unclear to me (not being a Perl, Python, etc jock) is how  
> much easier it would be to get these other implementations working  
> with Lucene, following a change to UTF-8. So I can't comment on the  
> return on time required to change things.
>
> I'm also curious about the existing CLucene & PyLucene ports. Would  
> they also need to be similarly modified, with the proposed changes?

PyLucene is literally the Java version of Lucene underneath (via GCJ/ 
SWIG), so no worries there.  CLucene would need to be changed, as  
well as DotLucene and the other ports out there.

If the rest of the world of Lucene ports followed suit with PyLucene  
and did the GCJ/SWIG thing, we'd have no problems :)  What are the  
disadvantages to following this model with Plucene?

     Erik


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Mime
View raw message