lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Grant Ingersoll <gsing...@apache.org>
Subject Re: [jira] Commented: (LUCENE-687) Performance improvement: Lazy skipping on proximity file
Date Wed, 18 Oct 2006 20:18:20 GMT
Can you share your performance test as well as the results?

http://issues.apache.org/jira/browse/LUCENE-675

Thanks,
Grant

On Oct 18, 2006, at 3:41 PM, Michael Busch (JIRA) wrote:

>     [ http://issues.apache.org/jira/browse/LUCENE-687? 
> page=comments#action_12443343 ]
>
> Michael Busch commented on LUCENE-687:
> --------------------------------------
>
> Hi Yonik,
>
> thanks for the quick reply! I'm going to do performance tests and  
> will give you some numbers soon.
>
>> Performance improvement: Lazy skipping on proximity file
>> --------------------------------------------------------
>>
>>                 Key: LUCENE-687
>>                 URL: http://issues.apache.org/jira/browse/LUCENE-687
>>             Project: Lucene - Java
>>          Issue Type: Improvement
>>          Components: Index
>>            Reporter: Michael Busch
>>            Priority: Minor
>>         Attachments: lazy_prox_skipping.patch
>>
>>
>> Hello,
>> I'm proposing a patch here that changes  
>> org.apache.lucene.index.SegmentTermPositions to avoid unnecessary  
>> skips and reads on the proximity stream. Currently a call of next 
>> () or seek(), which causes a movement to a document in the freq  
>> file also moves the prox pointer to the posting list of that  
>> document.  But this is only necessary if actual positions have to  
>> be retrieved for that particular document.
>> Consider for example a phrase query with two terms: the freq  
>> pointer for term 1 has to move to document x to answer the  
>> question if the term occurs in that document. But *only* if term 2  
>> also matches document x, the positions have to be read to figure  
>> out if term 1 and term 2 appear next to each other in document x  
>> and thus satisfy the query.
>> A move to the posting list of a document can be quite expensive.  
>> It has to be skipped to the last skip point before that document  
>> and then the documents between the skip point and the desired  
>> document have to be scanned, which means that the VInts of all  
>> positions of those documents have to be read and decoded.
>> An improvement is to move the prox pointer lazily to a document  
>> only if nextPosition() is called. This will become even more  
>> important in the future when the size of the proximity file  
>> increases (e. g. by adding payloads to the posting lists).
>> My patch implements this lazy skipping. All unit tests pass.
>> I also attach a new unit test that works as follows:
>> Using a RamDirectory an index is created and test docs are added.  
>> Then the index is optimized to make sure it only has a single  
>> segment. This is important, because IndexReader.open() returns an  
>> instance of SegmentReader if there is only one segment in the  
>> index. The proxStream instance of SegmentReader is package  
>> protected, so it is possible to set proxStream to a different  
>> object. I am using a class called SeeksCountingStream that extends  
>> IndexInput in a way that it is able to count the number of  
>> invocations of seek().
>> Then the testcase searches the index using a PhraseQuery "term1  
>> term2". It is known how many documents match that query and the  
>> testcase can verify that seek() on the proxStream is not called  
>> more often than number of search hits.
>> Example:
>> Number of docs in the index: 500
>> Number of docs that match the query "term1 term2": 5
>> Invocations of seek on prox stream (old code): 29
>> Invocations of seek on prox stream (patched version): 5
>> - Michael
>
> -- 
> This message is automatically generated by JIRA.
> -
> If you think it was sent incorrectly contact one of the  
> administrators: http://issues.apache.org/jira/secure/ 
> Administrators.jspa
> -
> For more information on JIRA, see: http://www.atlassian.com/ 
> software/jira
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-dev-help@lucene.apache.org
>

--------------------------
Grant Ingersoll
Sr. Software Engineer
Center for Natural Language Processing
Syracuse University
335 Hinds Hall
Syracuse, NY 13244
http://www.cnlp.org




---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Mime
View raw message