lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Marvin Humphrey <mar...@rectangular.com>
Subject bytecount as String and prefix length
Date Sun, 30 Oct 2005 22:39:31 GMT
Greets,

I've been experimenting with using the UTF-8 bytecount as the VInt  
count at the top of Lucene's string format, as was discussed back in  
the "Lucene does NOT use UTF-8" thread.  Changes were made to  
IndexInput and IndexOutput as per some of Robert Engel's  
suggestions.  Here's the implementation of writeString, which chooses  
the memory hit of buffering over the performance hit of pre-scanning:

   public void writeString(String s) throws IOException {
     utf8Bytes = s.getBytes("UTF-8");
     int length = utf8Bytes.length;
     writeVInt(length);
     writeBytes(utf8Bytes, length);
   }

That, in conjunction with a similar implementation of readString and  
the UTF-8-clean implementation of readChars I submitted a while back,  
executes 2-3% slower than the current implementation against my  
index-1000-wikipedia-articles benchmarker.

I also had a hack at TermBuffer, TermInfosWriter, and StringHelper,  
in an attempt to convert the prefix length for the Term Dictionary  
from chars to UTF-8 bytes.  (Note that I've left off changing  
TermVectors for now.)

A major change has to be inflicted on TermBuffer in order for it to  
work with a bytecount: the char[] "text" buffer must be swapped out  
for a byte[] "utf8Bytes" buffer.  With that change in place,  
TermBuffer's 'read' method can utilize readBytes instead of readChars:

   public final void read(IndexInput input, FieldInfos fieldInfos)
     throws IOException {
     this.term = null;                           // invalidate cache
     int start = input.readVInt();
     int length = input.readVInt();
     int totalLength = start + length;
     setutf8Length(totalLength);
     input.readBytes(this.utf8Bytes, start, length);
     this.field = fieldInfos.fieldName(input.readVInt());
   }

I'd thought this might speed up scanning through a SegmentTermEnum,  
since the intermediate step of converting UTF-8 bytes to chars was  
deferred until toTerm gets called.  However, a ScanEnum benchmarker I  
cooked up which calls next() a whole bunch indicated a wash.

Concerns were raised before about whether it would be necessary to  
convert all strings to UTF-8 to calculate prefix length for the Term  
Dictionary.  Yes, it's necessary to convert them -- though the same  
copy can be used as a byte buffer which obviates the need to call  
writeChars.

   private final void writeTerm(Term term)
        throws IOException {
     byte[] bytes = term.text().getBytes("UTF-8");
     int totalLength = bytes.length;

     int start = StringHelper.bytesDifference(lastBytes, bytes);
     int length = totalLength - start;

     output.writeVInt(start);                   // write shared  
prefix length
     output.writeVInt(length);                  // write delta length
     for (int i = start ; i < totalLength; i++) {
       output.writeByte(bytes[i]);              // write delta UTF-8  
bytes
     }

     output.writeVInt(fieldInfos.fieldNumber(term.field)); // write  
field num

     lastTerm = term;
     lastBytes = bytes;
   }


Unfortunately, once the changes to TermBuffer, TermInfosWriter, and  
StringHelper are applied, execution speed at index-time suffers a  
slowdown of about 20%.  Perhaps this can be blamed on all the calls  
to getBytes("UTF-8") in TermInfosWriter?  Maybe alternative  
implementations using ByteBuffer, CharsetDecoder, and CharsetEncoder  
are possible that can mitigate the problem?

A patch for all 5 files against repository revision 329640 is below;  
to apply it, cd to the trunk directory first.

Marvin Humphrey
Rectangular Research
http://www.rectangular.com/


diff -u -r -X../excludefile src_old/java/org/apache/lucene/index/ 
TermBuffer.java src/java/org/apache/lucene/index/TermBuffer.java
--- src_old/java/org/apache/lucene/index/TermBuffer.java     
2005-10-30 14:30:23.000000000 -0800
+++ src/java/org/apache/lucene/index/TermBuffer.java    2005-10-30  
13:27:47.000000000 -0800
@@ -20,40 +20,38 @@
import org.apache.lucene.store.IndexInput;
final class TermBuffer implements Cloneable {
-  private static final char[] NO_CHARS = new char[0];
+  private static final byte[] NO_BYTES = new byte[0];
    private String field;
-  private char[] text = NO_CHARS;
-  private int textLength;
+  private byte[] utf8Bytes = NO_BYTES;
+  private int utf8Length;
    private Term term;                            // cached
    public final int compareTo(TermBuffer other) {
      if (field == other.field)              // fields are interned
-      return compareChars(text, textLength, other.text,  
other.textLength);
+      return compareUTF8(utf8Bytes, utf8Length,
+        other.utf8Bytes, other.utf8Length);
      else
        return field.compareTo(other.field);
    }
-  private static final int compareChars(char[] v1, int len1,
-                                        char[] v2, int len2) {
+  private static final int compareUTF8(byte[] v1, int len1,
+                                       byte[] v2, int len2) {
      int end = Math.min(len1, len2);
      for (int k = 0; k < end; k++) {
-      char c1 = v1[k];
-      char c2 = v2[k];
-      if (c1 != c2) {
-        return c1 - c2;
-      }
+      if (v1[k] != v2[k])
+        return v1[k] - v2[k];
      }
      return len1 - len2;
    }
-  private final void setTextLength(int newLength) {
-    if (text.length < newLength) {
-      char[] newText = new char[newLength];
-      System.arraycopy(text, 0, newText, 0, textLength);
-      text = newText;
+  private final void setutf8Length(int newLength) {
+    if (utf8Bytes.length < newLength) {
+      byte[] newBytes = new byte[newLength];
+      System.arraycopy(utf8Bytes, 0, newBytes, 0, utf8Length);
+      utf8Bytes = newBytes;
      }
-    textLength = newLength;
+    utf8Length = newLength;
    }
    public final void read(IndexInput input, FieldInfos fieldInfos)
@@ -62,28 +60,30 @@
      int start = input.readVInt();
      int length = input.readVInt();
      int totalLength = start + length;
-    setTextLength(totalLength);
-    input.readChars(this.text, start, length);
+    setutf8Length(totalLength);
+    input.readBytes(this.utf8Bytes, start, length);
      this.field = fieldInfos.fieldName(input.readVInt());
    }
-  public final void set(Term term) {
-    if (term == null) {
+  public final void set(Term t)  {
+    if (t == null) {
        reset();
        return;
      }
-    // copy text into the buffer
-    setTextLength(term.text().length());
-    term.text().getChars(0, term.text().length(), text, 0);
+    // convert chars into UTF-8 bytes, store in buffer
+    try {
+        utf8Bytes = t.text().getBytes("UTF-8");
+    } catch (java.io.UnsupportedEncodingException e) { }
+    setutf8Length(utf8Bytes.length);
-    this.field = term.field();
-    this.term = term;
+    this.field = t.field();
+    this.term = t;
    }
    public final void set(TermBuffer other) {
-    setTextLength(other.textLength);
-    System.arraycopy(other.text, 0, text, 0, textLength);
+    setutf8Length(other.utf8Length);
+    System.arraycopy(other.utf8Bytes, 0, utf8Bytes, 0, utf8Length);
      this.field = other.field;
      this.term = other.term;
@@ -91,7 +91,7 @@
    public void reset() {
      this.field = null;
-    this.textLength = 0;
+    this.utf8Length = 0;
      this.term = null;
    }
@@ -100,7 +100,9 @@
        return null;
      if (term == null)
-      term = new Term(field, new String(text, 0, textLength), false);
+    try {
+      term = new Term(field, new String(utf8Bytes, 0, utf8Length,  
"UTF-8"), false);
+    } catch (java.io.UnsupportedEncodingException e) { }
      return term;
    }
@@ -111,8 +113,8 @@
        clone = (TermBuffer)super.clone();
      } catch (CloneNotSupportedException e) {}
-    clone.text = new char[text.length];
-    System.arraycopy(text, 0, clone.text, 0, textLength);
+    clone.utf8Bytes = new byte[utf8Bytes.length];
+    System.arraycopy(utf8Bytes, 0, clone.utf8Bytes, 0, utf8Length);
      return clone;
    }
diff -u -r -X../excludefile src_old/java/org/apache/lucene/index/ 
TermInfosWriter.java src/java/org/apache/lucene/index/ 
TermInfosWriter.java
--- src_old/java/org/apache/lucene/index/TermInfosWriter.java     
2005-10-30 14:30:23.000000000 -0800
+++ src/java/org/apache/lucene/index/TermInfosWriter.java     
2005-10-30 13:47:43.000000000 -0800
@@ -33,6 +33,8 @@
    private IndexOutput output;
    private Term lastTerm = new Term("", "");
    private TermInfo lastTi = new TermInfo();
+  private byte[] NO_BYTES = new byte[0];
+  private byte[] lastBytes = NO_BYTES;
    private long size = 0;
    // TODO: the default values for these two parameters should be  
settable from
@@ -91,8 +93,10 @@
      TermInfo pointers must be positive and greater than all  
previous.*/
    final void add(Term term, TermInfo ti)
         throws IOException {
-    if (!isIndex && term.compareTo(lastTerm) <= 0)
-      throw new IOException("term out of order");
+    if (!isIndex && term.compareTo(lastTerm) <= 0) {
+      throw new IOException("term out of order" + lastTerm.field +
+      "--" + lastTerm.text + "--" + term.field + "--" + term.text);
+    }
      if (ti.freqPointer < lastTi.freqPointer)
        throw new IOException("freqPointer out of order");
      if (ti.proxPointer < lastTi.proxPointer)
@@ -121,16 +125,22 @@
    private final void writeTerm(Term term)
         throws IOException {
-    int start = StringHelper.stringDifference(lastTerm.text,  
term.text);
-    int length = term.text.length() - start;
+    byte[] bytes = term.text().getBytes("UTF-8");
+    int totalLength = bytes.length;
+
+    int start = StringHelper.bytesDifference(lastBytes, bytes);
+    int length = totalLength - start;
      output.writeVInt(start);                   // write shared  
prefix length
      output.writeVInt(length);                  // write delta length
-    output.writeChars(term.text, start, length);  // write delta chars
+    for (int i = start ; i < totalLength; i++) {
+      output.writeByte(bytes[i]);              // write delta UTF-8  
bytes
+    }
      output.writeVInt(fieldInfos.fieldNumber(term.field)); // write  
field num
      lastTerm = term;
+    lastBytes = bytes;
    }
diff -u -r -X../excludefile src_old/java/org/apache/lucene/store/ 
IndexInput.java src/java/org/apache/lucene/store/IndexInput.java
--- src_old/java/org/apache/lucene/store/IndexInput.java     
2005-10-30 14:30:25.000000000 -0800
+++ src/java/org/apache/lucene/store/IndexInput.java    2005-10-30  
13:27:47.000000000 -0800
@@ -23,7 +23,7 @@
   * @see Directory
   */
public abstract class IndexInput implements Cloneable {
-  private char[] chars;                           // used by  
readString()
+  private byte[] utf8Bytes;                           // used by  
readString()
    /** Reads and returns a single byte.
     * @see IndexOutput#writeByte(byte)
@@ -87,32 +87,66 @@
     */
    public String readString() throws IOException {
      int length = readVInt();
-    if (chars == null || length > chars.length)
-      chars = new char[length];
-    readChars(chars, 0, length);
-    return new String(chars, 0, length);
-  }
-
-  /** Reads UTF-8 encoded characters into an array.
+    if (utf8Bytes == null || length > utf8Bytes.length)
+      utf8Bytes = new byte[length];
+    readBytes(utf8Bytes, 0, length);
+    return new String(utf8Bytes, 0, length, "UTF-8");
+  }
+
+  static final byte[] TRAILING_BYTES_FOR_UTF8 = {
+    0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
+    0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
+    0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
+    0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
+    0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
+    0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
+    0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
+    0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
+
+    0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
+    0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
+    0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
+    0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
+    1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,
+    1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,
+    2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,
+    3,3,3,3,3,3,3,3
+  };
+
+ /** Reads UTF-8 encoded characters into an array.
     * @param buffer the array to read characters into
     * @param start the offset in the array to start storing characters
     * @param length the number of characters to read
     * @see IndexOutput#writeChars(String,int,int)
     */
-  public void readChars(char[] buffer, int start, int length)
+  public void readChars(char[] buffer, int start, int length)
         throws IOException {
-    final int end = start + length;
+    final int end = start + length;
      for (int i = start; i < end; i++) {
-      byte b = readByte();
-      if ((b & 0x80) == 0)
-    buffer[i] = (char)(b & 0x7F);
-      else if ((b & 0xE0) != 0xE0) {
-    buffer[i] = (char)(((b & 0x1F) << 6)
-         | (readByte() & 0x3F));
-      } else
-    buffer[i] = (char)(((b & 0x0F) << 12)
-        | ((readByte() & 0x3F) << 6)
-            |  (readByte() & 0x3F));
+      byte b = readByte();
+      switch (TRAILING_BYTES_FOR_UTF8[b & 0xFF]) {
+        case 0:
+          buffer[i] = (char)(b & 0x7F);
+          break;
+        case 1:
+          buffer[i] = (char)(((b & 0x1F) << 6)
+            | (readByte() & 0x3F));
+          break;
+        case 2:
+          buffer[i] = (char)(((b & 0x0F) << 12)
+            | ((readByte() & 0x3F) << 6)
+            |  (readByte() & 0x3F));
+          break;
+        case 3:
+          int utf32 = (((b & 0x0F) << 18)
+            | ((readByte() & 0x3F) << 12)
+            | ((readByte() & 0x3F) << 6)
+            |  (readByte() & 0x3F));
+          buffer[i] = (char)((utf32 >> 10) + 0xD7C0);
+          i++;
+          buffer[i] = (char)((utf32 & 0x03FF) + 0xDC00);
+          break;
+      }
      }
    }
@@ -148,7 +182,7 @@
        clone = (IndexInput)super.clone();
      } catch (CloneNotSupportedException e) {}
-    clone.chars = null;
+    clone.utf8Bytes = null;
      return clone;
    }
diff -u -r -X../excludefile src_old/java/org/apache/lucene/store/ 
IndexOutput.java src/java/org/apache/lucene/store/IndexOutput.java
--- src_old/java/org/apache/lucene/store/IndexOutput.java     
2005-10-30 14:30:25.000000000 -0800
+++ src/java/org/apache/lucene/store/IndexOutput.java    2005-10-30  
13:27:47.000000000 -0800
@@ -25,6 +25,8 @@
   */
public abstract class IndexOutput {
+  private byte[] utf8Bytes;                        // used by  
writeString()
+
    /** Writes a single byte.
     * @see IndexInput#readByte()
     */
@@ -85,9 +87,11 @@
     * @see IndexInput#readString()
     */
    public void writeString(String s) throws IOException {
-    int length = s.length();
+    utf8Bytes = s.getBytes("UTF-8");
+
+    int length = utf8Bytes.length;
      writeVInt(length);
-    writeChars(s, 0, length);
+    writeBytes(utf8Bytes, length);
    }
    /** Writes a sequence of UTF-8 encoded characters from a string.
diff -u -r -X../excludefile src_old/java/org/apache/lucene/util/ 
StringHelper.java src/java/org/apache/lucene/util/StringHelper.java
--- src_old/java/org/apache/lucene/util/StringHelper.java     
2005-10-30 14:30:25.000000000 -0800
+++ src/java/org/apache/lucene/util/StringHelper.java    2005-10-30  
13:48:07.000000000 -0800
@@ -44,7 +44,26 @@
      return len;
    }
-
+  /**
+   * Compares two byte[] arrays, element by element, and returns the
+   * number of elements common to both arrays.
+   *
+   * @param bytes1 The first byte[] to compare
+   * @param bytes2 The second byte[] to compare
+   * @return The number of common elements.
+   */
+  public static final int bytesDifference(byte[] bytes1, byte[]  
bytes2) {
+    int len1 = bytes1.length;
+    int len2 = bytes2.length;
+    int len = len1 < len2 ? len1 : len2;
+    for (int i = 0; i < len; i++) {
+      if (bytes1[i] != bytes2[i]) {
+          return i;
+      }
+    }
+    return len;
+  }
+
    private StringHelper() {
    }
}


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Mime
View raw message