lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Marvin Humphrey <>
Subject Re: Lucene does NOT use UTF-8.
Date Mon, 29 Aug 2005 05:39:57 GMT
Hello, Robert...

On Aug 28, 2005, at 7:50 PM, Robert Engels wrote:

> Sorry, but I think you are barking up the wrong tree... and your  
> tone is
> quite bizarre. My personal OPINION is that your "script" language  
> is an
> abomination, and anyone that develops in it is clearly hurting the
> advancement of all software - but that is another story, and  
> doesn't matter
> much to the discussion - in a similar fashion your choice of words is
> clearly not gong to help matters.

My personal perspective is a utilitarian one: languages, platforms,  
they all come and go eventually, and in between a lot of stuff gets  
done.  I enjoy and appreciate Java (what I know of it), and I watched  
the Ruby/Java spat a little while ago with dismay.  The enmity is not  
returned.  :)

> It may be less efficient to decode in other languages, but I don't  
> think the
> original Lucene designers were too worried about the efficiencies  
> of other
> languages/platforms.

That may be the case.  I suppose we're about to find out how  
important the Lucene development community considers interchange.   
The phrase "standard UTF-8" in the documentation led me to believe  
that the intention was to deploy honest-to-goodness UTF-8.  In fact,  
as was pointed out, the early versions of the Unicode standard were  
not very clear.  Lucene was originally begun in 1998, and Unicode  
Corrigendum #1: "UTF-8 Shortest Form" wasn't released until 2001.  My  
best guess is that it was supposed to be legal UTF-8 and that the non- 
conformance is unintentional.

Otis Gospodnetic raised objections when the Plucene project made the  
decision to abandon index compatibility with Java Lucene.  I've been  
arguing that that decision ought to be reconsidered.  It will make it  
easier to achieve this shared goal of interoperability if Plucene  
does not have to go out of its way to defeat measures painstakingly  
put in place by the Perl5Porters team to ensure secure and robust  
Unicode support.

One of the reasons I have placed my own search engine project on hold  
was that I concluded I could not improve in a meaningful way on  
Lucene's file format.  It's really a marvelous piece of work.   
Perhaps it will become the TIFF of inverted index formats.  It seems  
to me that the Lucene project would benefit from having it widely  
adopted.  I'd like to help with that.

> Using String.getBytes("UTF-8"), and String.String(byte[],"UTF-8")  
> is all
> that is needed.

Thank you for the tip.  At first blush, I'm concerned that those may  
be difficult to make work with OutputStream's readByte() without  
incurring a performance penalty, but if I'm wrong and it's six-of-one- 
half-dozen-of-another for Java Lucene, then if a change is going to  
be made, I'll argue for that one.  That would harmonize with the way  
binary field data is stored, assuming that I can trust that portion  
of the spec document. ;)


Marvin Humphrey
Rectangular Research

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message