lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Michael McCandless (JIRA)" <>
Subject [jira] Updated: (LUCENE-510) IndexOutput.writeString() should write length in bytes
Date Mon, 17 Mar 2008 20:04:24 GMT


Michael McCandless updated LUCENE-510:

    Attachment: LUCENE-510.take2.patch

New rev of the patch.  I think it's ready to commit.  I'll wait a few

I made some performance improvements by factoring out a new
UnicodeUtil class that does not allocate new objects for every
conversion to/from UTF8.

One new issue I fixed is the handling of invalid UTF-16 strings.
Specifically if the UTF16 text has invalid surrogate pairs, UTF-8 is
unable to represent it (unlike the current modified UTF-8 Lucene
format).  I changed DocumentsWriter & UnicodeUtil to substitute the
replacement char U+FFFD for such invalid surrogate characters.  This
affects terms, stored String fields and term vectors.

Indexing performance has a small slowdown (3.5%); details are below.

Unfortunately, time to enumerate terms was more affected.  I made a
simple test that enumerates all terms from the index (= ~3.3 million
terms) created below:

  public class TestTermEnum {
    public static void main(String[] args) throws Exception {
      IndexReader r =[0]);
      TermEnum terms = r.terms();
      int count = 0;
      long t0 = System.currentTimeMillis();
      long t1 = System.currentTimeMillis();
      System.out.println(count + " terms in " + (t1-t0) + " millis");

On trunk with current index format this takes 3104 msec (best of 5).
With the patch with UTF8 index format it takes 3443 msec = 10.9%
slower.  I don't see any further ways to make this faster.

Details on the indexing performance test:

  doc.stored = true
  doc.term.vector = true
  { "Rounds"
    { "BuildIndex"
      { "AddDocs" AddDoc > : 200000
      - CloseIndex
  } : 5
  RepSumByPrefRound BuildIndex

I ran it on a quad-core Intel Mac Pro, with 4 drive RAID 0 array,
running OS 10.4.11, java 1.5, run with these command-line args:

  -server -Xbatch -Xms1024m -Xmx1024m

Best of 5 with current trunk is 921.2 docs/sec and with patch it's
888.7 = 3.5% slowdown.

> IndexOutput.writeString() should write length in bytes
> ------------------------------------------------------
>                 Key: LUCENE-510
>                 URL:
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Store
>    Affects Versions: 2.1
>            Reporter: Doug Cutting
>            Assignee: Michael McCandless
>         Attachments: LUCENE-510.patch, LUCENE-510.take2.patch,, strings.diff,
> We should change the format of strings written to indexes so that the length of the string
is in bytes, not Java characters.  This issue has been discussed at:
> We must increment the file format number to indicate this change.  At least the format
number in the segments file should change.
> I'm targetting this for 2.1, i.e., we shouldn't commit it to trunk until after 2.0 is
released, to minimize incompatible changes between 1.9 and 2.0 (other than removal of deprecated

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message