lucene-solr-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Yonik Seeley <>
Subject Re: svn commit: r808988 - in /lucene/solr/trunk: CHANGES.txt src/java/org/apache/solr/request/
Date Wed, 09 Sep 2009 01:10:31 GMT
On Tue, Sep 8, 2009 at 7:46 PM, Chris Hostetter<> wrote:
> The modifiedUTF8 boolean only influence the numeric length returned as the
> "s" option ... the actaully "val" string is still written "as is" by the
> servlet container.


A code point (unicode character) outside of the BMP (basic
multilingual plane, fits in 16 bits) is represented as two java chars
- a surrogate pair.  It's a single logical character - see
String.codePointAt().  In correct UTF-8 it should be encoded as a
single code point... but Jetty is ignoring the fact that it's a
surrogate pair and encoding each Java char as it's own code point...
this is often called modified-UTF8 or CESU-8.

So... say you have this incorrect CESU-8 that is masquerading as
UTF-8: all is not lost.
- A decoder can unambiguously recognize that the characters decoded
actually form a surrogate pair and correctly decode - but I don't know
if there are aby requirements to do so (doubt it), and I don't know
which do so.
- A decoder decoding into UTF-16 (like Java Strings) will correctly
decode anyway, even if treating each code point in the pair as

PHP5 doesn't even do unicode, so it won't care.
PHP6 apparently has unicode support - don't know much about it.

Bottom line - if we correctly encapsulate whatever the servlet
container is writing, it's certainly possible for clients to use the
output correctly.

> you've also changed the behavior so that even when
> false==modifiedUTF8, the length is now computed differently then before
> the patch (using UnicodeUtil.UTF16toUTF8) ... it's this second change i
> don't understand based on the context of the bug: why does the length
> value need to be computed differnetly for non-jetty implemntations?

It will be the same length for non-jetty implementations - I just
rewrote the entire method and used the Lucene UTF16toUTF8 for
performance reasons.  (bad developer, bad!)


View raw message