lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Yonik Seeley <ysee...@gmail.com>
Subject Re: Lucene and UTF-8
Date Wed, 21 Sep 2005 19:25:35 GMT
How does this patch work w.r.t. the length vint?

It looks like the length is still the number of 16 bit java chars,
but the encoding is now correct UTF-8?


-Yonik
Now hiring -- http://tinyurl.com/7m67g

On 9/21/05, Marvin Humphrey <marvin@rectangular.com> wrote:
>
> On Sep 20, 2005, at 11:53 PM, Chris Lamprecht wrote:
>
> > import java.util.Arrays;
> >
> > ...
> >
> > Arrays.equals(array1, array2);
>
> Great, thank you, Chris.
>
> The patch for IndexOutput.java is done. It will now write valid
> UTF-8. Older versions of Lucene will not be able to read indexes
> written using this class, as they will choke if they encounter a null
> byte or a 4-byte UTF-8 sequence.
>
> As an added bonus, this patch yields a speedup of a couple percentage
> points (on my machine), made possible by simplified conditionals.
> For instance, the first if() clause...
>
> if (code >= 0x01 && code <= 0x7F)
>
> ...is now...
>
> if (code < 0x80)
>
> The new TestIndexOutput.java class is sort of done. It has all the
> tests Ken suggested, though I think it could stand the addition of a
> randomized test to excite edge cases. The data mirrors the data from
> TestIndexInput.java, and that's by design, as I think with so much
> overlap the two ought to be merged. How does "TestIndexIO.java" grab
> you all?
>
> On Aug 29, 2005, at 11:49 AM, Ken Krugler wrote:
>
> > a. Single surrogate pair (two Java chars)
> > b. Surrogate pair at the beginning, followed by regular data.
> > c. Surrogate pair at the end, followed by regular data.
> > d. Two surrogate pairs in a row.
> >
> > Then all of the above, but remove the second (low-order) surrogate
> > character (busted format).
> >
> > Then all of the above, but replace the first (high-order) surrogate
> > character.
>
> A minor wrinkle: each unpaired surrogate will have to be replaced by
> the Unicode replacement character U+FFFD, or the VInt count will be
> off. This means that a UTF-16LE sequence will grow by a code point,
> as the (mis-ordered) surrogate pair (representing a single code
> point), will get subbed out for two replacement characters. I don't
> think this is serious, though.
>
> > Then all of the above, but replace the surrogate pair with an xC0
> > x80 encoded null byte.
>
> I left this out of the test cases for IndexOutput (it's in there, and
> important, for IndexInput). The UTF-16 sequence "\u00C0\u0080"
> doesn't map to a null, so I used the regular UTF-16 null "\u0000".
> As before, I think this is what you intended.
>
> Files and patches can be found here:
>
> http://www.rectangular.com/downloads/IndexOutput.patch
> http://www.rectangular.com/downloads/MockIndexOutput.java
> http://www.rectangular.com/downloads/TestIndexOutput.java
>
> Marvin Humphrey
> Rectangular Research
> http://www.rectangular.com/
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-dev-help@lucene.apache.org
>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message