lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "paul.elschot (JIRA)" <>
Subject [jira] Commented: (LUCENE-502) TermScorer caches values unnecessarily
Date Fri, 03 Mar 2006 20:55:41 GMT
    [ ] 

paul.elschot commented on LUCENE-502:

>> The question is how does the caching help when you have multiple documents. My analysis
is that (with a modern VM) it helps you only if the docFreq of a term is 16-31 and you are
using a ConjunctiveScorer (i.e. not Wildcard searches). 
> The conjunctive scorer does not call score(HitCollector,int). This is only called in
a few cases anymore. It can help a lot with a single-term query for a very common term, or
for disjunctive queries involving very common terms, although BooleanScorer2 no longer uses
it in this case. That's too bad. If all clauses to a query are optional, then the old BooleanScorer
was faster. But it didn't always return documents in order... So it indeed may be time to
retire this method. 

With BooleanScorer2 It is quite possible to use different versions of DisjunctionScorer:
one for query top level that does not need skipTo(), and one for lower level that allows
skipTo(). The top level one can be implemented just like the "old" BooleanScorer.

Iirc the method to implement such different behaviour are already in place (for scoring a
range of documents),
it only needs to be implemented for DisjunctionScorer, and the top level BooleanScorer2 should
use it when appropriate.

Paul Elschot

> TermScorer caches values unnecessarily
> --------------------------------------
>          Key: LUCENE-502
>          URL:
>      Project: Lucene - Java
>         Type: Improvement
>   Components: Search
>     Versions: 1.9
>     Reporter: Steven Tamm
>  Attachments: TermScorer.patch
> TermScorer aggressively caches the doc and freq of 32 documents at a time for each term
scored.  When querying for a lot of terms, this causes a lot of garbage to be created that's
unnecessary.  The SegmentTermDocs from which it retrieves its information doesn't have any
optimizations for bulk loading, and it's unnecessary.
> In addition, it has a SCORE_CACHE, that's of limited benefit.  It's caching the result
of a sqrt that should be placed in DefaultSimilarity, and if you're only scoring a few documents
that contain those terms, there's no need to precalculate the SQRT, especially on modern VMs.
> Enclosed is a patch that replaces TermScorer with a version that does not cache the docs
or feqs.  In the case of a lot of queries, that saves 196 bytes/term, the unnecessary disk
IO, and extra SQRTs which adds up.

This message is automatically generated by JIRA.
If you think it was sent incorrectly contact one of the administrators:
For more information on JIRA, see:

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message