lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Michael McCandless (JIRA)" <>
Subject [jira] Commented: (LUCENE-510) IndexOutput.writeString() should write length in bytes
Date Wed, 19 Mar 2008 22:40:24 GMT


Michael McCandless commented on LUCENE-510:

I'm wondering why the patch doesn't utilize java.nio.charset.CharsetEncoder, CharsetDecoder....?

I think there are two reasons for rolling our own instead of using
CharsetEncoder/Decoder.  First is performance.  If I use
CharsetEncoder, like this:

  CharsetEncoder encoder = Charset.forName("UTF-8").newEncoder();
  CharBuffer cb = CharBuffer.allocate(5000);
  ByteBuffer bb = ByteBuffer.allocate(5000);
  byte[] bbArray = bb.array();
  UnicodeUtil.UTF8Result utf8Result = new UnicodeUtil.UTF8Result();

  t0 = System.currentTimeMillis();
  for(int i=0;i<count;i++) {
    encoder.encode(cb, bb, true);

Then it takes 676 msec to convert ~3.3 million strings from the terms
from indexing first 200K Wikipedia docs.  If I replace for loop with:

  UnicodeUtil.UTF16toUTF8(strings[i], 0, strings[i].length(), utf8Result);

It's 441 msec.

Second reason is some API mismatch.  EG we need to convert char[] that
end in the 0xffff character.  Also, we need to do incremental
conversion (only convert changed bytes), which is used by TermEnum.
CharsetEncoder/Decoder doesn't do this.

> IndexOutput.writeString() should write length in bytes
> ------------------------------------------------------
>                 Key: LUCENE-510
>                 URL:
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Store
>    Affects Versions: 2.1
>            Reporter: Doug Cutting
>            Assignee: Michael McCandless
>         Attachments: LUCENE-510.patch, LUCENE-510.take2.patch,, strings.diff,
> We should change the format of strings written to indexes so that the length of the string
is in bytes, not Java characters.  This issue has been discussed at:
> We must increment the file format number to indicate this change.  At least the format
number in the segments file should change.
> I'm targetting this for 2.1, i.e., we shouldn't commit it to trunk until after 2.0 is
released, to minimize incompatible changes between 1.9 and 2.0 (other than removal of deprecated

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message