Nice optimization! If I read this right, I believe that this should
have a beneficial impact on RangeQuery as well as PrefixQuery. Have you
had a chance to try it?
Thanks,
Scott
Dmitry Serebrennikov wrote:
> Greetings,
>
> I've been running Lucene (inside of our application) in OptimizeIt to
> see if I can improve garbage generation and performance metrics of our
> app. Let me say right a way that Lucene is great! :) The more
> experience I have with it the more I find how well it performs.
> However, perhaps there are a few places still left in Lucene that
> could use further optimization. Particularly, I've been looking at
> SegmentTermEnum, since our app does a lot of what I call "term
> queries" - queries that result in finding matching terms rather than
> matching documents.
>
> It seems that there are at least two other threads right now on the
> list that are discussing similar types of queries, but I'm not too
> familiar with those particular APIs. What we do is instantiate
> TermEnums and than iterate through them in various ways. For example,
> the following code fragment is what I used for testing. It is similar
> to some of the queries that we run. I also think that any type of
> "starts-with" wildcard query would also experience the same issues.
>
> private static void run2(IndexReader ir) throws Exception {
> for (int i=0; i<10; i++) {
> for (char c1='a'; c1 < 'z'; c1++) {
> for (char c2='a'; c2 < 'z'; c2++) {
> Term starter = new Term(FIELD, "" + c1 + c2 );
> TermEnum terms = ir.terms(starter);
> if (terms.isValid()) {
> Term target = terms.term();
> System.out.println(target);
> }
> }
> }
> }
> }
>
> The main source of garbage when running this code appears to be in
> SegmentTermEnum, which insists on creating Term instances (with the
> Strings and char[]s that go into them) for each term that it looks at.
> The typical pattern is that it picks a place in the term enum using
> the term index that is near the term I am requesting, seek()s to it,
> and then proceeds to retrieve terms sequentially until it gets to the
> term that I need.
>
> Now, to the question / proposal part:
> 1) Since I do not need the intermediate terms, it makes sence to try
> to have a method that skips to the right term without creating the
> intermediate Term objects. I have done a version of this yesterday and
> ended up seeing a factor of 2 performance encrease and a factor of 2
> garbage reduction. The patch adds the following method to Term.java:
> final int compareTo(String otherField, char[] otherText, int start,
> int len)
> And changes SegmentTermEnum.java to delay creation of Term object
> until call to term().
> Full diff is attached. Any comments are welcome, especially if I've
> missed something.
>
> 2) So far, so good, but perhaps we can do better? I am looking for
> other suggestions to implement around this functionality. For example,
> the above change helps with "starts-with" queries, but will do nothing
> for "contains" types of queries. Perhaps a better way to do this is to
> provide a way for users to supply a TermPredicate that would be
> evaluated against the term data without having to construct a Term
> object for that. The skipTo(TermPredicate) would then stop at the next
> matching term or the end of the enum.
>
> 3) I found a piece of code in TermInfosReader.java that uses a field
> SegmentTermEnum.prev to try to optimize seeks. It looks like this code
> was put in after the original SegmentTermEnum was finished. I can't
> find any record of this change in Jakarta's CVS, so probably it was
> done prior to moving to Jakarta. Does anyone remember why this is
> here? Does it actually serve a useful purpose? It seems that the
> condition this code is testing for would not really occur. Perhaps I'm
> missing something. Here's the code fragment that uses the .prev field:
>
> /** Returns the TermInfo for a Term in the set, or null. */
> final synchronized TermInfo get(Term term) throws IOException {
> if (size == 0) return null;
> // optimize sequential access: first try scanning cached enum w/o
> seeking
> if (enum.term() != null // term is at or past current
> && ((enum.prev != null && term.compareTo(enum.prev) > 0)
> || term.compareTo(enum.term()) >= 0)) {
> int enumOffset = (enum.position/TermInfosWriter.INDEX_INTERVAL)+1;
> if (indexTerms.length == enumOffset // but before end of
> block
> || term.compareTo(indexTerms[enumOffset]) < 0)
> return scanEnum(term); // no need to seek
> }
> // random-access: must seek
> seekEnum(getIndexOffset(term));
> return scanEnum(term);
> }
>
>
>------------------------------------------------------------------------
>
>Index: Term.java
>===================================================================
>RCS file: /home/cvspublic/jakarta-lucene/src/java/org/apache/lucene/index/Term.java,v
>retrieving revision 1.4
>diff -B -u -w -r1.4 Term.java
>--- Term.java 13 Jan 2003 23:50:33 -0000 1.4
>+++ Term.java 25 Feb 2003 19:16:25 -0000
>@@ -110,6 +110,22 @@
> return field.compareTo(other.field);
> }
>
>+ /** Compares this term to a term represented as a field (interned) and
>+ * a fragment of a character array containing the term's text.
>+ */
>+ final int compareTo(String otherField, char[] otherText, int start, int len) {
>+ if (field == otherField) {
>+ int end = Math.min(text.length(), len);
>+ for (int i=0; i < end; i++) {
>+ int res = text.charAt(i) - otherText[start + i];
>+ if (res != 0) return res;
>+ }
>+ return text.length() - len;
>+ } else {
>+ return field.compareTo(otherField);
>+ }
>+ }
>+
> /** Resets the field and text of a Term. */
> final void set(String fld, String txt) {
> field = fld;
>Index: SegmentTermEnum.java
>===================================================================
>RCS file: /home/cvspublic/jakarta-lucene/src/java/org/apache/lucene/index/SegmentTermEnum.java,v
>retrieving revision 1.2
>diff -B -u -w -r1.2 SegmentTermEnum.java
>--- SegmentTermEnum.java 11 Oct 2001 15:14:14 -0000 1.2
>+++ SegmentTermEnum.java 25 Feb 2003 19:19:28 -0000
>@@ -63,7 +63,6 @@
> int size;
> int position = -1;
>
>- private Term term = new Term("", "");
> private TermInfo termInfo = new TermInfo();
>
> boolean isIndex = false;
>@@ -71,6 +70,9 @@
> Term prev;
>
> private char[] buffer = {};
>+ private String currentTermField;
>+ private int currentTermLength = -1;
>+ private Term cachedTerm; // only created when asked for
>
> SegmentTermEnum(InputStream i, FieldInfos fis, boolean isi)
> throws IOException {
>@@ -88,7 +90,10 @@
>
> clone.input = (InputStream)input.clone();
> clone.termInfo = new TermInfo(termInfo);
>- if (term != null) clone.growBuffer(term.text.length());
>+ clone.currentTermField = currentTermField;
>+ clone.currentTermLength = currentTermLength;
>+ if (buffer != null)
>+ clone.buffer = (char[]) buffer.clone();
>
> return clone;
> }
>@@ -97,21 +102,56 @@
> throws IOException {
> input.seek(pointer);
> position = p;
>- term = t;
>+
> prev = null;
> termInfo.set(ti);
>- growBuffer(term.text.length()); // copy term text into buffer
>+ currentTermField = t.field;
>+ currentTermLength = t.text.length();
>+ cachedTerm = null;
>+ buffer = t.text.toCharArray();
> }
>
>+
>+ /** Skips terms to the first beyond the current whose value is
>+ * greater or equal to <i>target</i>. <p>Returns true iff there is
such
>+ * an entry. <p>Behaves as if written: <pre>
>+ * public boolean skipTo(Term target) {
>+ * do {
>+ * if (!next())
>+ * return false;
>+ * } while (target > term());
>+ * return true;
>+ * }
>+ * </pre>
>+ * Some implementations are considerably more efficient than that.
>+ */
>+ public final boolean skipTo(Term target) throws IOException {
>+ prev = null;
>+
>+ do {
>+ if (! next()) return false;
>+ } while (target.compareTo(currentTermField, buffer, 0, currentTermLength) > 0);
>+
>+ return true;
>+ }
>+
>+
> /** Increments the enumeration to the next element. True if one exists.*/
> public final boolean next() throws IOException {
>+ cachedTerm = null;
>+
> if (position++ >= size-1) {
>- term = null;
>+ currentTermLength = -1;
> return false;
> }
>
>- prev = term;
>- term = readTerm();
>+ int start = input.readVInt();
>+ int length = input.readVInt();
>+ currentTermLength = start + length;
>+ growBuffer(currentTermLength);
>+
>+ input.readChars(buffer, start, length);
>+ currentTermField = fieldInfos.fieldName(input.readVInt());
>
> termInfo.docFreq = input.readVInt(); // read doc freq
> termInfo.freqPointer += input.readVLong(); // read freq pointer
>@@ -123,28 +163,26 @@
> return true;
> }
>
>- private final Term readTerm() throws IOException {
>- int start = input.readVInt();
>- int length = input.readVInt();
>- int totalLength = start + length;
>- if (buffer.length < totalLength)
>- growBuffer(totalLength);
>-
>- input.readChars(buffer, start, length);
>- return new Term(fieldInfos.fieldName(input.readVInt()),
>- new String(buffer, 0, totalLength), false);
>- }
>
> private final void growBuffer(int length) {
>- buffer = new char[length];
>- for (int i = 0; i < term.text.length(); i++) // copy contents
>- buffer[i] = term.text.charAt(i);
>+ if (buffer == null || buffer.length < length) {
>+ char tmp[] = new char[length];
>+ System.arraycopy(buffer, 0, tmp, 0, buffer.length);
>+ buffer = tmp;
>+ }
> }
>
> /** Returns the current Term in the enumeration.
> Initially invalid, valid after next() called for the first time.*/
> public final Term term() {
>- return term;
>+ if (currentTermLength >= 0) {
>+ if (cachedTerm == null) {
>+ cachedTerm = new Term(currentTermField, new String(buffer, 0, currentTermLength),
false);
>+ }
>+ return cachedTerm;
>+ } else {
>+ return null;
>+ }
> }
>
> /** Returns the current TermInfo in the enumeration.
>Index: TermInfosReader.java
>===================================================================
>RCS file: /home/cvspublic/jakarta-lucene/src/java/org/apache/lucene/index/TermInfosReader.java,v
>retrieving revision 1.2
>diff -B -w -u -r1.2 TermInfosReader.java
>--- TermInfosReader.java 29 Jan 2003 17:18:54 -0000 1.2
>+++ TermInfosReader.java 25 Feb 2003 19:21:11 -0000
>@@ -162,7 +162,7 @@
>
> /** Scans within block for matching term. */
> private final TermInfo scanEnum(Term term) throws IOException {
>- while (term.compareTo(enum.term()) > 0 && enum.next()) {}
>+ if (term.compareTo(enum.term()) > 0) enum.skipTo(term);
> if (enum.term() != null && term.compareTo(enum.term()) == 0)
> return enum.termInfo();
> else
>
>
>
>------------------------------------------------------------------------
>
>---------------------------------------------------------------------
>To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
>For additional commands, e-mail: lucene-dev-help@jakarta.apache.org
>
--
"Perfection in design is not achieved when there is nothing more to add, but when there is
nothing more to remove." - Antoine de Saint-Exupéry
---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-dev-help@jakarta.apache.org
|