lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Michael McCandless (JIRA)" <j...@apache.org>
Subject [jira] Commented: (LUCENE-510) IndexOutput.writeString() should write length in bytes
Date Wed, 19 Mar 2008 22:40:24 GMT

    [ https://issues.apache.org/jira/browse/LUCENE-510?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12580592#action_12580592
] 

Michael McCandless commented on LUCENE-510:
-------------------------------------------

{quote}
I'm wondering why the patch doesn't utilize java.nio.charset.CharsetEncoder, CharsetDecoder....?
{quote}

I think there are two reasons for rolling our own instead of using
CharsetEncoder/Decoder.  First is performance.  If I use
CharsetEncoder, like this:

  CharsetEncoder encoder = Charset.forName("UTF-8").newEncoder();
  CharBuffer cb = CharBuffer.allocate(5000);
  ByteBuffer bb = ByteBuffer.allocate(5000);
  byte[] bbArray = bb.array();
  UnicodeUtil.UTF8Result utf8Result = new UnicodeUtil.UTF8Result();

  t0 = System.currentTimeMillis();
  for(int i=0;i<count;i++) {
    cb.clear();
    cb.put(strings[i]);
    cb.flip();
    bb.clear();
    encoder.reset();
    encoder.encode(cb, bb, true);
  }

Then it takes 676 msec to convert ~3.3 million strings from the terms
from indexing first 200K Wikipedia docs.  If I replace for loop with:

  UnicodeUtil.UTF16toUTF8(strings[i], 0, strings[i].length(), utf8Result);

It's 441 msec.

Second reason is some API mismatch.  EG we need to convert char[] that
end in the 0xffff character.  Also, we need to do incremental
conversion (only convert changed bytes), which is used by TermEnum.
CharsetEncoder/Decoder doesn't do this.



> IndexOutput.writeString() should write length in bytes
> ------------------------------------------------------
>
>                 Key: LUCENE-510
>                 URL: https://issues.apache.org/jira/browse/LUCENE-510
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Store
>    Affects Versions: 2.1
>            Reporter: Doug Cutting
>            Assignee: Michael McCandless
>         Attachments: LUCENE-510.patch, LUCENE-510.take2.patch, SortExternal.java, strings.diff,
TestSortExternal.java
>
>
> We should change the format of strings written to indexes so that the length of the string
is in bytes, not Java characters.  This issue has been discussed at:
> http://www.mail-archive.com/java-dev@lucene.apache.org/msg01970.html
> We must increment the file format number to indicate this change.  At least the format
number in the segments file should change.
> I'm targetting this for 2.1, i.e., we shouldn't commit it to trunk until after 2.0 is
released, to minimize incompatible changes between 1.9 and 2.0 (other than removal of deprecated
features).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Mime
View raw message