lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Grant Ingersoll <grant.ingers...@gmail.com>
Subject Re: [jira] Commented: (LUCENE-687) Performance improvement: Lazy skipping on proximity file
Date Thu, 19 Oct 2006 02:33:27 GMT
I have a working copy using the news groups in the issue, but need to  
split it out into a shorter version, as suggested by some earlier  
threads concerning the issue.  I hope to get to it committed this  
week, if not early next week.

-Grant

On Oct 18, 2006, at 4:26 PM, Steven Parkes wrote:

> Any idea when you're going to post a snapshot of your 675 stuff,  
> Grant?
>
> -----Original Message-----
> From: Grant Ingersoll [mailto:gsingers@apache.org]
> Sent: Wednesday, October 18, 2006 1:18 PM
> To: java-dev@lucene.apache.org
> Subject: Re: [jira] Commented: (LUCENE-687) Performance improvement:
> Lazy skipping on proximity file
>
> Can you share your performance test as well as the results?
>
> http://issues.apache.org/jira/browse/LUCENE-675
>
> Thanks,
> Grant
>
> On Oct 18, 2006, at 3:41 PM, Michael Busch (JIRA) wrote:
>
>>     [ http://issues.apache.org/jira/browse/LUCENE-687?
>> page=comments#action_12443343 ]
>>
>> Michael Busch commented on LUCENE-687:
>> --------------------------------------
>>
>> Hi Yonik,
>>
>> thanks for the quick reply! I'm going to do performance tests and
>> will give you some numbers soon.
>>
>>> Performance improvement: Lazy skipping on proximity file
>>> --------------------------------------------------------
>>>
>>>                 Key: LUCENE-687
>>>                 URL: http://issues.apache.org/jira/browse/LUCENE-687
>>>             Project: Lucene - Java
>>>          Issue Type: Improvement
>>>          Components: Index
>>>            Reporter: Michael Busch
>>>            Priority: Minor
>>>         Attachments: lazy_prox_skipping.patch
>>>
>>>
>>> Hello,
>>> I'm proposing a patch here that changes
>>> org.apache.lucene.index.SegmentTermPositions to avoid unnecessary
>>> skips and reads on the proximity stream. Currently a call of next
>>> () or seek(), which causes a movement to a document in the freq
>>> file also moves the prox pointer to the posting list of that
>>> document.  But this is only necessary if actual positions have to
>>> be retrieved for that particular document.
>>> Consider for example a phrase query with two terms: the freq
>>> pointer for term 1 has to move to document x to answer the
>>> question if the term occurs in that document. But *only* if term 2
>>> also matches document x, the positions have to be read to figure
>>> out if term 1 and term 2 appear next to each other in document x
>>> and thus satisfy the query.
>>> A move to the posting list of a document can be quite expensive.
>>> It has to be skipped to the last skip point before that document
>>> and then the documents between the skip point and the desired
>>> document have to be scanned, which means that the VInts of all
>>> positions of those documents have to be read and decoded.
>>> An improvement is to move the prox pointer lazily to a document
>>> only if nextPosition() is called. This will become even more
>>> important in the future when the size of the proximity file
>>> increases (e. g. by adding payloads to the posting lists).
>>> My patch implements this lazy skipping. All unit tests pass.
>>> I also attach a new unit test that works as follows:
>>> Using a RamDirectory an index is created and test docs are added.
>>> Then the index is optimized to make sure it only has a single
>>> segment. This is important, because IndexReader.open() returns an
>>> instance of SegmentReader if there is only one segment in the
>>> index. The proxStream instance of SegmentReader is package
>>> protected, so it is possible to set proxStream to a different
>>> object. I am using a class called SeeksCountingStream that extends
>>> IndexInput in a way that it is able to count the number of
>>> invocations of seek().
>>> Then the testcase searches the index using a PhraseQuery "term1
>>> term2". It is known how many documents match that query and the
>>> testcase can verify that seek() on the proxStream is not called
>>> more often than number of search hits.
>>> Example:
>>> Number of docs in the index: 500
>>> Number of docs that match the query "term1 term2": 5
>>> Invocations of seek on prox stream (old code): 29
>>> Invocations of seek on prox stream (patched version): 5
>>> - Michael
>>
>> -- 
>> This message is automatically generated by JIRA.
>> -
>> If you think it was sent incorrectly contact one of the
>> administrators: http://issues.apache.org/jira/secure/
>> Administrators.jspa
>> -
>> For more information on JIRA, see: http://www.atlassian.com/
>> software/jira
>>
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-dev-help@lucene.apache.org
>>
>
> --------------------------
> Grant Ingersoll
> Sr. Software Engineer
> Center for Natural Language Processing
> Syracuse University
> 335 Hinds Hall
> Syracuse, NY 13244
> http://www.cnlp.org
>
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-dev-help@lucene.apache.org
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-dev-help@lucene.apache.org
>

------------------------------------------------------
Grant Ingersoll
http://www.grantingersoll.com/



---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Mime
View raw message