lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Michael Busch (JIRA)" <j...@apache.org>
Subject [jira] Commented: (LUCENE-687) Performance improvement: Lazy skipping on proximity file
Date Mon, 30 Oct 2006 18:36:19 GMT
    [ http://issues.apache.org/jira/browse/LUCENE-687?page=comments#action_12445683 ] 
            
Michael Busch commented on LUCENE-687:
--------------------------------------

I made some performance experiments with phrase searches. For every query I measured the search
time as well as the number of invocations of IndexInput.readVint(). While the search time
is very specific for my test environment, the number of Vints should tell us how much I/O
we safe independent from the actual hardware.

Test environment:
2x Intel Xeon CPU 2.80GHz
4GB RAM
280GB SCSI HD
Windows Server 2003


My index contains about 460,000 documents from different websites like www.cnn.com, www.joelonsoftware.com,...,
also German sites like www.heise.de.
The search times below are average values over 10 searches.

Tests:
- Query: "the der" 
  This should be a hard query, because "the" appears in almost every English document whereas
"der" in almost every German document.
  Number of hits: 15
  Execution time in ms (old, new, improvement): 651.3, 472.1, 28%
  # if VInts (old, new, improvement): 25786474, 16007634, 38%
  
- Query: "joel on software"
  "joel" only appears in a small subset of documents, "on" in all English documents, and "software"
can appear in German and English documents.
  Number of hits: 1613
  Execution time in ms (old, new, improvement): 33.0, 20.1, 39%
  # if VInts (old, new, improvement): 950243, 256481, 73%
  
- Query: "much much more"
  This is an example where we don't save so much I/O, because it is likely that all terms
appear in the most English documents.
  Number of hits: 76
  Execution time in ms (old, new, improvement): 228.2, 215.5, 6%
  # if VInts (old, new, improvement): 3580288, 3067693, 14%
  
I think these numbers look pretty good. In the future, when we hopefully have proximity scoring,
every search would benefit from this improvement. (I actually have a running version of proximity
scoring in my local code and verified this). 

Did you have some time to look into the patch yet, Yonik? Let me know if you need more information,
please.

> Performance improvement: Lazy skipping on proximity file
> --------------------------------------------------------
>
>                 Key: LUCENE-687
>                 URL: http://issues.apache.org/jira/browse/LUCENE-687
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Index
>            Reporter: Michael Busch
>            Priority: Minor
>         Attachments: lazy_prox_skipping.patch
>
>
> Hello,
> I'm proposing a patch here that changes org.apache.lucene.index.SegmentTermPositions
to avoid unnecessary skips and reads on the proximity stream. Currently a call of next() or
seek(), which causes a movement to a document in the freq file also moves the prox pointer
to the posting list of that document.  But this is only necessary if actual positions have
to be retrieved for that particular document. 
> Consider for example a phrase query with two terms: the freq pointer for term 1 has to
move to document x to answer the question if the term occurs in that document. But *only*
if term 2 also matches document x, the positions have to be read to figure out if term 1 and
term 2 appear next to each other in document x and thus satisfy the query. 
> A move to the posting list of a document can be quite expensive. It has to be skipped
to the last skip point before that document and then the documents between the skip point
and the desired document have to be scanned, which means that the VInts of all positions of
those documents have to be read and decoded. 
> An improvement is to move the prox pointer lazily to a document only if nextPosition()
is called. This will become even more important in the future when the size of the proximity
file increases (e. g. by adding payloads to the posting lists).
> My patch implements this lazy skipping. All unit tests pass. 
> I also attach a new unit test that works as follows:
> Using a RamDirectory an index is created and test docs are added. Then the index is optimized
to make sure it only has a single segment. This is important, because IndexReader.open() returns
an instance of SegmentReader if there is only one segment in the index. The proxStream instance
of SegmentReader is package protected, so it is possible to set proxStream to a different
object. I am using a class called SeeksCountingStream that extends IndexInput in a way that
it is able to count the number of invocations of seek(). 
> Then the testcase searches the index using a PhraseQuery "term1 term2". It is known how
many documents match that query and the testcase can verify that seek() on the proxStream
is not called more often than number of search hits.
> Example:
> Number of docs in the index: 500
> Number of docs that match the query "term1 term2": 5
> Invocations of seek on prox stream (old code): 29
> Invocations of seek on prox stream (patched version): 5
> - Michael

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Mime
View raw message