lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Dmitry Serebrennikov <>
Subject Re: Optimizing SegmentTermEnum (and friends)
Date Tue, 25 Feb 2003 23:29:39 GMT
Thanks for your reply, Doug. See blow.

Doug Cutting wrote:

> Dmitry Serebrennikov wrote:
>> 1) Since I do not need the intermediate terms, it makes sence to try 
>> to have a method that skips to the right term without creating the 
>> intermediate Term objects. I have done a version of this yesterday 
>> and ended up seeing a factor of 2 performance encrease and a factor 
>> of 2 garbage reduction. The patch adds the following method to 
>> final int compareTo(String otherField, char[] otherText, int start, 
>> int len)
>> And changes to delay creation of Term object 
>> until call to term().
>> Full diff is attached. Any comments are welcome, especially if I've 
>> missed something.
> Looks reasonable to me.  Does it still pass all of the unit tests? 

Have not had a chance to run them. I will report results once I do.

>>  /** Returns the TermInfo for a Term in the set, or null. */
>>  final synchronized TermInfo get(Term term) throws IOException {
>>    if (size == 0) return null;
>>      // optimize sequential access: first try scanning cached enum 
>> w/o seeking
>>    if (enum.term() != null              // term is at or past current
>>        && ((enum.prev != null && term.compareTo(enum.prev) > 0)
>>            || term.compareTo(enum.term()) >= 0)) {
>>        int enumOffset = 
>> (enum.position/TermInfosWriter.INDEX_INTERVAL)+1;
>>        if (indexTerms.length == enumOffset      // but before end of 
>> block
>>            || term.compareTo(indexTerms[enumOffset]) < 0)
>>                return scanEnum(term);              // no need to seek
>>    }
>>      // random-access: must seek
>>    seekEnum(getIndexOffset(term));
>>    return scanEnum(term);
>>  }
> If you put a print statement in this and run the unit tests you'll see 
> that this optimization fires a lot.  If, e.g., one expands a 
> wildcarded string into a bunch of terms, which are near one another in 
> the enum, then subsequently asks for the frequency of each term (to 
> weight it in a query), and then, in a third pass, ask for its 
> TermDocs, then each of these latter passes benefit from this 
> optimization.  So let's not lose it.

I know that the optimizaion as a whole is important, but I was curious 
to know how important was the use of .prev variable here. In order to 
maintain this variable, SegmentsTermEnum is forced to create Term 
objects that could otherwise be avoided.
If I read this code correctly, the optimization kicks in when
    enum has a current term &&
       (( enum remembers previous term && that term is less than the 
target term ) ||
           the current term is less or equal to the target term )
The only time the value of the .prev variable is significant is when the 
enum has a current term but that term is greater than the target. If at 
the same time enum also remembers the previous term and that term is 
less that the target, the optimization is enabled.

Oh, I see, this is important when the target term is not in the enum... 
There's got to be a better way to implement this that does not require 
copying the buffer in the SegmentsTermEnum.


> Doug

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message