lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Scott Ganyo <scott.ga...@etapestry.com>
Subject Re: Optimizing SegmentTermEnum (and friends)
Date Tue, 25 Feb 2003 22:23:02 GMT
Nice optimization!  If I read this right, I believe that this should 
have a beneficial impact on RangeQuery as well as PrefixQuery.  Have you 
had a chance to try it?

Thanks,
Scott

Dmitry Serebrennikov wrote:

> Greetings,
>
> I've been running Lucene (inside of our application) in OptimizeIt to 
> see if I can improve garbage generation and performance metrics of our 
> app. Let me say right a way that Lucene is great! :) The more 
> experience I have with it the more I find how well it performs. 
> However, perhaps there are a few places still left in Lucene that 
> could use further optimization. Particularly, I've been looking at 
> SegmentTermEnum, since our app does a lot of what I call "term 
> queries" - queries that result in finding matching terms rather than 
> matching documents.
>
> It seems that there are at least two other threads right now on the 
> list that are discussing similar types of queries, but I'm not too 
> familiar with those particular APIs. What we do is instantiate 
> TermEnums and than iterate through them in various ways. For example, 
> the following code fragment is what I used for testing. It is similar 
> to some of the queries that we run. I also think that any type of 
> "starts-with" wildcard query would also experience the same issues.
>
>    private static void run2(IndexReader ir) throws Exception {
>        for (int i=0; i<10; i++) {
>            for (char c1='a'; c1 < 'z'; c1++) {
>                for (char c2='a'; c2 < 'z'; c2++) {
>                   Term starter = new Term(FIELD, "" + c1 + c2 );
>                   TermEnum terms = ir.terms(starter);
>                   if (terms.isValid()) {
>                       Term target = terms.term();
>                       System.out.println(target);
>                   }
>                }
>            }
>        }
>    }
>
> The main source of garbage when running this code appears to be in 
> SegmentTermEnum, which insists on creating Term instances (with the 
> Strings and char[]s that go into them) for each term that it looks at. 
> The typical pattern is that it picks a place in the term enum using 
> the term index that is near the term I am requesting, seek()s to it, 
> and then proceeds to retrieve terms sequentially until it gets to the 
> term that I need.
>
> Now, to the question / proposal part:
> 1) Since I do not need the intermediate terms, it makes sence to try 
> to have a method that skips to the right term without creating the 
> intermediate Term objects. I have done a version of this yesterday and 
> ended up seeing a factor of 2 performance encrease and a factor of 2 
> garbage reduction. The patch adds the following method to Term.java:
> final int compareTo(String otherField, char[] otherText, int start, 
> int len)
> And changes SegmentTermEnum.java to delay creation of Term object 
> until call to term().
> Full diff is attached. Any comments are welcome, especially if I've 
> missed something.
>
> 2) So far, so good, but perhaps we can do better? I am looking for 
> other suggestions to implement around this functionality. For example, 
> the above change helps with "starts-with" queries, but will do nothing 
> for "contains" types of queries. Perhaps a better way to do this is to 
> provide a way for users to supply a TermPredicate that would be 
> evaluated against the term data without having to construct a Term 
> object for that. The skipTo(TermPredicate) would then stop at the next 
> matching term or the end of the enum.
>
> 3) I found a piece of code in TermInfosReader.java that uses a field 
> SegmentTermEnum.prev to try to optimize seeks. It looks like this code 
> was put in after the original SegmentTermEnum was finished. I can't 
> find any record of this change in Jakarta's CVS, so probably it was 
> done prior to moving to Jakarta. Does anyone remember why this is 
> here? Does it actually serve a useful purpose? It seems that the 
> condition this code is testing for would not really occur. Perhaps I'm 
> missing something. Here's the code fragment that uses the .prev field:
>
>  /** Returns the TermInfo for a Term in the set, or null. */
>  final synchronized TermInfo get(Term term) throws IOException {
>    if (size == 0) return null;
>      // optimize sequential access: first try scanning cached enum w/o 
> seeking
>    if (enum.term() != null              // term is at or past current
>        && ((enum.prev != null && term.compareTo(enum.prev) > 0)
>            || term.compareTo(enum.term()) >= 0)) {
>        int enumOffset = (enum.position/TermInfosWriter.INDEX_INTERVAL)+1;
>        if (indexTerms.length == enumOffset      // but before end of 
> block
>            || term.compareTo(indexTerms[enumOffset]) < 0)
>                return scanEnum(term);              // no need to seek
>    }
>      // random-access: must seek
>    seekEnum(getIndexOffset(term));
>    return scanEnum(term);
>  }
>
>
>------------------------------------------------------------------------
>
>Index: Term.java
>===================================================================
>RCS file: /home/cvspublic/jakarta-lucene/src/java/org/apache/lucene/index/Term.java,v
>retrieving revision 1.4
>diff -B -u -w -r1.4 Term.java
>--- Term.java	13 Jan 2003 23:50:33 -0000	1.4
>+++ Term.java	25 Feb 2003 19:16:25 -0000
>@@ -110,6 +110,22 @@
>       return field.compareTo(other.field);
>   }
> 
>+  /** Compares this term to a term represented as a field (interned) and 
>+   *  a fragment of a character array containing the term's text.
>+   */
>+  final int compareTo(String otherField, char[] otherText, int start, int len) {
>+    if (field == otherField) {
>+        int end = Math.min(text.length(), len);
>+        for (int i=0; i < end; i++) {
>+            int res = text.charAt(i) - otherText[start + i];
>+            if (res != 0) return res;
>+        }
>+        return text.length() - len;
>+    } else {
>+        return field.compareTo(otherField);
>+    }
>+  }
>+  
>   /** Resets the field and text of a Term. */
>   final void set(String fld, String txt) {
>     field = fld;
>Index: SegmentTermEnum.java
>===================================================================
>RCS file: /home/cvspublic/jakarta-lucene/src/java/org/apache/lucene/index/SegmentTermEnum.java,v
>retrieving revision 1.2
>diff -B -u -w -r1.2 SegmentTermEnum.java
>--- SegmentTermEnum.java	11 Oct 2001 15:14:14 -0000	1.2
>+++ SegmentTermEnum.java	25 Feb 2003 19:19:28 -0000
>@@ -63,7 +63,6 @@
>   int size;
>   int position = -1;
> 
>-  private Term term = new Term("", "");
>   private TermInfo termInfo = new TermInfo();
> 
>   boolean isIndex = false;
>@@ -71,6 +70,9 @@
>   Term prev;
> 
>   private char[] buffer = {};
>+  private String currentTermField;
>+  private int currentTermLength = -1;
>+  private Term cachedTerm; // only created when asked for
> 
>   SegmentTermEnum(InputStream i, FieldInfos fis, boolean isi)
>        throws IOException {
>@@ -88,7 +90,10 @@
> 
>     clone.input = (InputStream)input.clone();
>     clone.termInfo = new TermInfo(termInfo);
>-    if (term != null) clone.growBuffer(term.text.length());
>+    clone.currentTermField = currentTermField;
>+    clone.currentTermLength = currentTermLength;
>+    if (buffer != null)
>+        clone.buffer = (char[]) buffer.clone();
> 
>     return clone;
>   }
>@@ -97,21 +102,56 @@
>        throws IOException {
>     input.seek(pointer);
>     position = p;
>-    term = t;
>+    
>     prev = null;
>     termInfo.set(ti);
>-    growBuffer(term.text.length());		  // copy term text into buffer
>+    currentTermField = t.field;
>+    currentTermLength = t.text.length();
>+    cachedTerm = null;
>+    buffer = t.text.toCharArray();
>   }
> 
>+  
>+  /** Skips terms to the first beyond the current whose value is
>+   * greater or equal to <i>target</i>. <p>Returns true iff there is
such
>+   * an entry.  <p>Behaves as if written: <pre>
>+   *   public boolean skipTo(Term target) {
>+   *     do {
>+   *       if (!next())
>+   * 	     return false;
>+   *     } while (target > term());
>+   *     return true;
>+   *   }
>+   * </pre>
>+   * Some implementations are considerably more efficient than that.
>+   */
>+  public final boolean skipTo(Term target) throws IOException {
>+    prev = null;
>+    
>+    do {
>+        if (! next()) return false;
>+    } while (target.compareTo(currentTermField, buffer, 0, currentTermLength) > 0);
>+    
>+    return true;
>+  }
>+
>+  
>   /** Increments the enumeration to the next element.  True if one exists.*/
>   public final boolean next() throws IOException {
>+    cachedTerm = null;
>+  
>     if (position++ >= size-1) {
>-      term = null;
>+      currentTermLength = -1;
>       return false;
>     }
> 
>-    prev = term;
>-    term = readTerm();
>+    int start = input.readVInt();
>+    int length = input.readVInt();
>+    currentTermLength = start + length;
>+    growBuffer(currentTermLength);
>+
>+    input.readChars(buffer, start, length);
>+    currentTermField = fieldInfos.fieldName(input.readVInt());
> 
>     termInfo.docFreq = input.readVInt();	  // read doc freq
>     termInfo.freqPointer += input.readVLong();	  // read freq pointer
>@@ -123,28 +163,26 @@
>     return true;
>   }
> 
>-  private final Term readTerm() throws IOException {
>-    int start = input.readVInt();
>-    int length = input.readVInt();
>-    int totalLength = start + length;
>-    if (buffer.length < totalLength)
>-      growBuffer(totalLength);
>-    
>-    input.readChars(buffer, start, length);
>-    return new Term(fieldInfos.fieldName(input.readVInt()),
>-		    new String(buffer, 0, totalLength), false);
>-  }
> 
>   private final void growBuffer(int length) {
>-    buffer = new char[length];
>-    for (int i = 0; i < term.text.length(); i++)  // copy contents
>-      buffer[i] = term.text.charAt(i);
>+    if (buffer == null || buffer.length < length) {
>+        char tmp[] = new char[length];
>+        System.arraycopy(buffer, 0, tmp, 0, buffer.length);
>+        buffer = tmp;
>+    }
>   }
> 
>   /** Returns the current Term in the enumeration.
>     Initially invalid, valid after next() called for the first time.*/
>   public final Term term() {
>-    return term;
>+    if (currentTermLength >= 0) {
>+        if (cachedTerm == null) {
>+            cachedTerm = new Term(currentTermField, new String(buffer, 0, currentTermLength),
false);
>+        }
>+        return cachedTerm;
>+    } else {
>+        return null;
>+    }
>   }
> 
>   /** Returns the current TermInfo in the enumeration.
>Index: TermInfosReader.java
>===================================================================
>RCS file: /home/cvspublic/jakarta-lucene/src/java/org/apache/lucene/index/TermInfosReader.java,v
>retrieving revision 1.2
>diff -B -w -u -r1.2 TermInfosReader.java
>--- TermInfosReader.java	29 Jan 2003 17:18:54 -0000	1.2
>+++ TermInfosReader.java	25 Feb 2003 19:21:11 -0000
>@@ -162,7 +162,7 @@
>   
>   /** Scans within block for matching term. */
>   private final TermInfo scanEnum(Term term) throws IOException {
>-    while (term.compareTo(enum.term()) > 0 && enum.next()) {}
>+    if (term.compareTo(enum.term()) > 0) enum.skipTo(term);
>     if (enum.term() != null && term.compareTo(enum.term()) == 0)
>       return enum.termInfo();
>     else
>
>  
>
>------------------------------------------------------------------------
>
>---------------------------------------------------------------------
>To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
>For additional commands, e-mail: lucene-dev-help@jakarta.apache.org
>

-- 
"Perfection in design is not achieved when there is nothing more to add, but when there is
nothing more to remove." - Antoine de Saint-Exupéry



---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-dev-help@jakarta.apache.org


Mime
View raw message