lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Steven Tamm (JIRA)" <>
Subject [jira] Commented: (LUCENE-502) TermScorer caches values unnecessarily
Date Fri, 03 Mar 2006 18:58:39 GMT
    [ ] 

Steven Tamm commented on LUCENE-502:

The main point is this:  When you are using TermScorer to score one document, it is doing
a lot of extra work.  It's reading 31 extra documents from the disk and calculating the weight
factors for 31 documents.   The question is how does the caching help when you have multiple
documents.  My analysis is that (with a modern VM) it helps you only if the docFreq of a term
is 16-31 and you are using a ConjunctiveScorer (i.e. not Wildcard searches).  I would imagine
this is a use case that is not uncommon.  Anyone using Wildcard searches will have *immediate*
benefit from installing this patch.

So I'm going to analyze this from the "amount of work to do" perspective.  If you are calling there is no real difference.[],
float[]) is no different from calling 32 times.  The change in the
patch switches to always calling next on the underlying SegmentTermDocs.
 The only cost I'm removing is the caching and I'm not adding any new ones.  Therefore there's
no change, with the exception of adding the cache for use in skipTo().

TermScorer.skipTo():  The only case where my patch is worse is if the frequency of the term
is greater than the skip interval (i.e >= 16 documents per term).  In this case, if you
are retrieving more than 16 documents (but less than 32), you can avoid accessing the skipStream
entirely.  If you are retrieving more than 32 documents, then you need to access the skipStream
anyway, and since both of the underlying IndexInput's are cached, repositioning the freqStream
will be only pointer manipulation.

"In some cases JVM's may have evolved so that some of them are no longer required."  I can
imagine that the scoreCache made a lot of sense in JDK 1.1 when the cost of Math.sqrt would
be high.  However, if the TermScorer is only going to be used for a single document, this
is obviously wrong.   Like I said before, caching inside DefaultSimilarity
would end up inlined by the HotSpot compiler, but Math.sqrt is inlined into a processor trap,
so it's not a big deal.

I want other people to test this and tell me any problems with it.  Whether or not you accept
the patches into are less important to me than providing them to other people that have similar
performance problems.  Perhaps I should have created a parallel structure to TermScorer that
you can use when you have a low hit/term ratio. 

> TermScorer caches values unnecessarily
> --------------------------------------
>          Key: LUCENE-502
>          URL:
>      Project: Lucene - Java
>         Type: Improvement
>   Components: Search
>     Versions: 1.9
>     Reporter: Steven Tamm
>  Attachments: TermScorer.patch
> TermScorer aggressively caches the doc and freq of 32 documents at a time for each term
scored.  When querying for a lot of terms, this causes a lot of garbage to be created that's
unnecessary.  The SegmentTermDocs from which it retrieves its information doesn't have any
optimizations for bulk loading, and it's unnecessary.
> In addition, it has a SCORE_CACHE, that's of limited benefit.  It's caching the result
of a sqrt that should be placed in DefaultSimilarity, and if you're only scoring a few documents
that contain those terms, there's no need to precalculate the SQRT, especially on modern VMs.
> Enclosed is a patch that replaces TermScorer with a version that does not cache the docs
or feqs.  In the case of a lot of queries, that saves 196 bytes/term, the unnecessary disk
IO, and extra SQRTs which adds up.

This message is automatically generated by JIRA.
If you think it was sent incorrectly contact one of the administrators:
For more information on JIRA, see:

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message