lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ken Krugler <>
Subject Re: Lucene and UTF-8
Date Mon, 29 Aug 2005 18:49:00 GMT
Hi Marvin,

>I'm guessing that since I'm the one that cares most about 
>interoperability, I'll have to volunteer to do the heavy lifting.
>Tomorrow I'll go through and survey how many and which things would 
>need to change to achieve full UTF-8 compliance.  One concern is 
>that I think in order to make that last case work, readChars() may 
>need to return an array.  Since readChars() is part of the public 
>API and may be called by something other than readString(), I don't 
>know if that'll fly.

I don't believe such a change would be required, since the ultimate 
data source/destination on the Java side will look the same (array of 
Java chars) - the only issue is how it looks when serialized.

>It seems clear that you have sufficient expertise to hone my rough 
>contributions into final form.  If you have the interest, would that 
>be a good division of labor?  I wish I could do this alone and just 
>supply finished, tested patches, but obviously I can't.  Or perhaps 
>I'm underestimating your level of interest -- do you want to take 
>the ball and run with it?

I can take a look at the code, sure. The hard part will be coding up 
the JUnit test cases (see below).

>I think we could stand to have 2 corpuses of test documents 
>available: one is which predominantly 2-byte and 3-byte UTF-8 (but 
>no 4-byte), and another which has the full range including non-BMP 
>code points.  I can hunt those down or maybe get somebody from the 
>Plucene community to create them, but perhaps they already exist?

Good test data for the decoder would be the following:

a. Single surrogate pair (two Java chars)
b. Surrogate pair at the beginning, followed by regular data.
c. Surrogate pair at the end, followed by regular data.
d. Two surrogate pairs in a row.

Then all of the above, but remove the second (low-order) surrogate 
character (busted format).

Then all of the above, but replace the first (high-order) surrogate character.

Then all of the above, but replace the surrogate pair with an xC0 x80 
encoded null byte.

And no, I don't think this test data exists, unfortunately. But it 
shouldn't be too hard to generate.

-- Ken
Ken Krugler
TransPac Software, Inc.
+1 530-470-9200

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message