lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Steven Parkes" <>
Subject RE: [jira] Commented: (LUCENE-687) Performance improvement: Lazy skipping on proximity file
Date Wed, 18 Oct 2006 20:26:03 GMT
Any idea when you're going to post a snapshot of your 675 stuff, Grant?

-----Original Message-----
From: Grant Ingersoll [] 
Sent: Wednesday, October 18, 2006 1:18 PM
Subject: Re: [jira] Commented: (LUCENE-687) Performance improvement:
Lazy skipping on proximity file

Can you share your performance test as well as the results?


On Oct 18, 2006, at 3:41 PM, Michael Busch (JIRA) wrote:

>     [ 
> page=comments#action_12443343 ]
> Michael Busch commented on LUCENE-687:
> --------------------------------------
> Hi Yonik,
> thanks for the quick reply! I'm going to do performance tests and  
> will give you some numbers soon.
>> Performance improvement: Lazy skipping on proximity file
>> --------------------------------------------------------
>>                 Key: LUCENE-687
>>                 URL:
>>             Project: Lucene - Java
>>          Issue Type: Improvement
>>          Components: Index
>>            Reporter: Michael Busch
>>            Priority: Minor
>>         Attachments: lazy_prox_skipping.patch
>> Hello,
>> I'm proposing a patch here that changes  
>> org.apache.lucene.index.SegmentTermPositions to avoid unnecessary  
>> skips and reads on the proximity stream. Currently a call of next 
>> () or seek(), which causes a movement to a document in the freq  
>> file also moves the prox pointer to the posting list of that  
>> document.  But this is only necessary if actual positions have to  
>> be retrieved for that particular document.
>> Consider for example a phrase query with two terms: the freq  
>> pointer for term 1 has to move to document x to answer the  
>> question if the term occurs in that document. But *only* if term 2  
>> also matches document x, the positions have to be read to figure  
>> out if term 1 and term 2 appear next to each other in document x  
>> and thus satisfy the query.
>> A move to the posting list of a document can be quite expensive.  
>> It has to be skipped to the last skip point before that document  
>> and then the documents between the skip point and the desired  
>> document have to be scanned, which means that the VInts of all  
>> positions of those documents have to be read and decoded.
>> An improvement is to move the prox pointer lazily to a document  
>> only if nextPosition() is called. This will become even more  
>> important in the future when the size of the proximity file  
>> increases (e. g. by adding payloads to the posting lists).
>> My patch implements this lazy skipping. All unit tests pass.
>> I also attach a new unit test that works as follows:
>> Using a RamDirectory an index is created and test docs are added.  
>> Then the index is optimized to make sure it only has a single  
>> segment. This is important, because returns an  
>> instance of SegmentReader if there is only one segment in the  
>> index. The proxStream instance of SegmentReader is package  
>> protected, so it is possible to set proxStream to a different  
>> object. I am using a class called SeeksCountingStream that extends  
>> IndexInput in a way that it is able to count the number of  
>> invocations of seek().
>> Then the testcase searches the index using a PhraseQuery "term1  
>> term2". It is known how many documents match that query and the  
>> testcase can verify that seek() on the proxStream is not called  
>> more often than number of search hits.
>> Example:
>> Number of docs in the index: 500
>> Number of docs that match the query "term1 term2": 5
>> Invocations of seek on prox stream (old code): 29
>> Invocations of seek on prox stream (patched version): 5
>> - Michael
> -- 
> This message is automatically generated by JIRA.
> -
> If you think it was sent incorrectly contact one of the  
> administrators: 
> Administrators.jspa
> -
> For more information on JIRA, see: 
> software/jira
> ---------------------------------------------------------------------
> To unsubscribe, e-mail:
> For additional commands, e-mail:

Grant Ingersoll
Sr. Software Engineer
Center for Natural Language Processing
Syracuse University
335 Hinds Hall
Syracuse, NY 13244

To unsubscribe, e-mail:
For additional commands, e-mail:

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message