lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Dmitry Serebrennikov <dmit...@earthlink.net>
Subject Optimizing SegmentTermEnum (and friends)
Date Tue, 25 Feb 2003 19:39:49 GMT
Greetings,

I've been running Lucene (inside of our application) in OptimizeIt to 
see if I can improve garbage generation and performance metrics of our 
app. Let me say right a way that Lucene is great! :) The more experience 
I have with it the more I find how well it performs. However, perhaps 
there are a few places still left in Lucene that could use further 
optimization. Particularly, I've been looking at SegmentTermEnum, since 
our app does a lot of what I call "term queries" - queries that result 
in finding matching terms rather than matching documents.

It seems that there are at least two other threads right now on the list 
that are discussing similar types of queries, but I'm not too familiar 
with those particular APIs. What we do is instantiate TermEnums and than 
iterate through them in various ways. For example, the following code 
fragment is what I used for testing. It is similar to some of the 
queries that we run. I also think that any type of "starts-with" 
wildcard query would also experience the same issues.

    private static void run2(IndexReader ir) throws Exception {
        for (int i=0; i<10; i++) {
            for (char c1='a'; c1 < 'z'; c1++) {
                for (char c2='a'; c2 < 'z'; c2++) {
                   Term starter = new Term(FIELD, "" + c1 + c2 );
                   TermEnum terms = ir.terms(starter);
                   if (terms.isValid()) {
                       Term target = terms.term();
                       System.out.println(target);
                   }
                }
            }
        }
    }

The main source of garbage when running this code appears to be in 
SegmentTermEnum, which insists on creating Term instances (with the 
Strings and char[]s that go into them) for each term that it looks at. 
The typical pattern is that it picks a place in the term enum using the 
term index that is near the term I am requesting, seek()s to it, and 
then proceeds to retrieve terms sequentially until it gets to the term 
that I need.

Now, to the question / proposal part:
1) Since I do not need the intermediate terms, it makes sence to try to 
have a method that skips to the right term without creating the 
intermediate Term objects. I have done a version of this yesterday and 
ended up seeing a factor of 2 performance encrease and a factor of 2 
garbage reduction. The patch adds the following method to Term.java:
 final int compareTo(String otherField, char[] otherText, int start, int 
len)
And changes SegmentTermEnum.java to delay creation of Term object until 
call to term().
Full diff is attached. Any comments are welcome, especially if I've 
missed something.

2) So far, so good, but perhaps we can do better? I am looking for other 
suggestions to implement around this functionality. For example, the 
above change helps with "starts-with" queries, but will do nothing for 
"contains" types of queries. Perhaps a better way to do this is to 
provide a way for users to supply a TermPredicate that would be 
evaluated against the term data without having to construct a Term 
object for that. The skipTo(TermPredicate) would then stop at the next 
matching term or the end of the enum.

3) I found a piece of code in TermInfosReader.java that uses a field 
SegmentTermEnum.prev to try to optimize seeks. It looks like this code 
was put in after the original SegmentTermEnum was finished. I can't find 
any record of this change in Jakarta's CVS, so probably it was done 
prior to moving to Jakarta. Does anyone remember why this is here? Does 
it actually serve a useful purpose? It seems that the condition this 
code is testing for would not really occur. Perhaps I'm missing 
something. Here's the code fragment that uses the .prev field:

  /** Returns the TermInfo for a Term in the set, or null. */
  final synchronized TermInfo get(Term term) throws IOException {
    if (size == 0) return null;
   
    // optimize sequential access: first try scanning cached enum w/o 
seeking
    if (enum.term() != null              // term is at or past current
        && ((enum.prev != null && term.compareTo(enum.prev) > 0)
            || term.compareTo(enum.term()) >= 0)) {
        int enumOffset = (enum.position/TermInfosWriter.INDEX_INTERVAL)+1;
        if (indexTerms.length == enumOffset      // but before end of block
            || term.compareTo(indexTerms[enumOffset]) < 0)
                return scanEnum(term);              // no need to seek
    }
   
    // random-access: must seek
    seekEnum(getIndexOffset(term));
    return scanEnum(term);
  }
 

Mime
View raw message