lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ken Krugler <>
Subject Re: Lucene does NOT use UTF-8.
Date Sat, 27 Aug 2005 21:11:34 GMT
>I've delved into the matter of Lucene and UTF-8 a little further, 
>and I am discouraged by what I believe I've uncovered.
>Lucene should not be advertising that it uses "standard UTF-8" -- or 
>even UTF-8 at all, since "Modified UTF-8" is _illegal_ UTF-8.

Unfortunately this is how Sun documents the format they use for 
serialized strings.

>The two distinguishing characteristics of "Modified UTF-8" are the 
>treatment of codepoints above the BMP (which are written as 
>surrogate pairs), and the encoding of null bytes as 1100 0000 1000 
>0000 rather than 0000 0000.  Both of these became illegal as of 
>Unicode 3.1 (IIRC), because they are not shortest-form and 
>non-shortest-form UTF-8 presents a security risk.

For UTF-8 these were always invalid, but the standard wasn't very 
clear about it. Unfortunately the fuzzy nature of the 1.0/2.0 specs 
encouraged some sloppy implementations.

>The documentation should really state that Lucene stores strings in 
>a Java-only adulteration of UTF-8,

Yes, good point. I don't know who's in charge of that page, but it 
should be fixed.

>unsuitable for interchange.

Other than as an internal representation for Java serialization.

>Since Perl uses true shortest-form UTF-8 as its native encoding, 
>Plucene would have to jump through two efficiency-killing hoops in 
>order to write files that would not choke Lucene: instead of writing 
>out its true, legal UTF-8 directly, it would be necessary to first 
>translate to UTF-16, then duplicate the Lucene encoding algorithm 
>from OutputStream.  In theory.

Actually I don't think it would be all that bad. Since a null in the 
middle of a string is rare, as is a character outside of the BMP, a 
quick scan of the text should be sufficient to determine if it can be 
written as-is.

The ICU project has C code that can be used to quickly walk a string. 
I believe these would find/report such invalid code points, if you 
use the safe (versus faster unsafe) versions.

>Below you will find a simple Perl script which illustrates what 
>happens when Perl encounters malformed UTF-8.  Run it (you need Perl 
>5.8 or higher) and you will see why even if I thought it was a good 
>idea to emulate the Java hack for encoding "Modified UTF-8", trying 
>to make it work in practice would be a nightmare.
>If Plucene were to write legal UTF-8 strings to its index files, 
>Java Lucene would misbehave and possibly blow up any time a string 
>contained either a 4-byte character or a null byte.  On the flip 
>side, Perl will spew warnings like crazy and possibly blow up 
>whenever it encounters a Lucene-encoded null or surrogate pair.  The 
>potential blowups are due to the fact that Lucene and Plucene will 
>not agree on how many characters a string contains, resulting in 
>overruns or underruns.
>I am hoping that the answer to this will be a fix to the encoding 
>mechanism in Lucene so that it really does use legal UTF-8.  The 
>most efficient way to go about this has not yet presented itself.

I'd need to look at the code more, but using something other than the 
Java serialized format would probably incur a performance penalty for 
the Java implementation. Or at least make it harder to handle the 
strings using the standard Java serialization support. So I doubt 
this would be a slam-dunk in the Lucene community.

-- Ken

>use strict;
>use warnings;
># illegal_null.plx -- Perl complains about non-shortest-form null.
>my $data = "foo\xC0\x80\n";
>open (my $virtual_filehandle, "+<:utf8", \$data);
>print <$virtual_filehandle>;

Ken Krugler
TransPac Software, Inc.
+1 530-470-9200

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message