lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Marvin Humphrey <mar...@rectangular.com>
Subject Re: Lucene and UTF-8
Date Wed, 21 Sep 2005 02:26:41 GMT
Hello again,

I've prepared a patch for IndexInput.java, and an accompanying patch  
for TestIndexInput.java.  I figured I would submit them for  
discussion here before filing them via Jira.  The patches are  
attached to this email; if I find that they get stripped by the  
listserv, I'll post them on a website.

The patch to IndexInput.java makes it capable of decoding both  
modified UTF-8 and valid UTF-8, so backwards compatibility is  
preserved.  I'll have another patch for IndexOutput.java soon, but  
IndexInput.java doesn't have to wait for it.

A crude benchmarking app I already have set up (it just builds an  
index with 1000 docs) seems to support my expectation: this change to  
IndexInput should have little or no impact on speed with western,  
mostly-ascii text.  It might actually be a smidgen faster with text  
which is mostly multi-byte UTF-8, since an if-else-if chain with  
calculations within conditionals has been replaced by a switch based  
on a lookup table.  The only real cost for this patch is the memory  
hit for loading the 248-byte lookup table.

My local copy of trunk revision 590297 passes all tests with these  
patches, except for TestIndexModifier which fails regardless.

Ken Krugler wrote...

> Good test data for the decoder would be the following:
>
> a. Single surrogate pair (two Java chars)
> b. Surrogate pair at the beginning, followed by regular data.
> c. Surrogate pair at the end, followed by regular data.
> d. Two surrogate pairs in a row.

I've selected U+1D11E "MUSICAL SYMBOL G CLEF" and U+1D160 "MUSICAL  
SYMBOL EIGHTH NOTE" as the non-BMP code points of choice.

http://www.fileformat.info/info/unicode/char/01d11e/index.htm
http://www.fileformat.info/info/unicode/char/01d160/index.htm

It might be my quadranoia acting up again, but it seemed like a good  
idea to add another test case, since UTF-8 is a stateful encoding  
(within a short span):

e. A string with two embedded surrogate pairs.

"Lu\uD834\uDD1Ece\uD834\uDD60ne"

> Then all of the above, but remove the second (low-order) surrogate  
> character (busted format).
>
> Then all of the above, but replace the first (high-order) surrogate  
> character.

These are interesting.  Lucene isn't equipped for detection/ 
correction of invalid Unicode when reading its own index files, and  
implementing such capabilities would impose a performance penalty.   
The assumption is that Lucene will always read its own files and that  
those files will never contain corrupt data.  Debatable, but it  
doesn't seem to have caused problems up till now.

Since there's no way to check if IndexInput catches invalid input,  
I've skipped these two cases -- but I'll put them in my upcoming  
IndexOutput patches, which is I think what you intended anyway.

> Then all of the above, but replace the surrogate pair with an xC0  
> x80 encoded null byte.

Done.

Three more test batches seemed appropriate.

Cases for the \x00 null, which would previously have been interpreted  
incorrectly as the start of a 3-byte UTF-8 sequence.

Cases for two-byte UTF-8, using U+00BF "INVERTED QUESTION MARK".
http://www.fileformat.info/info/unicode/char/00bf/index.htm

Cases for three-byte UTF-8, using U+2620 "SKULL AND CROSSBONES".
http://www.fileformat.info/info/unicode/char/2620/index.htm

Previously, there was only a test for the string "Lucene".

Marvin Humphrey
Rectangular Research
http://www.rectangular.com/


Mime
View raw message