lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Doug Cutting (JIRA)" <>
Subject [jira] Commented: (LUCENE-502) TermScorer caches values unnecessarily
Date Fri, 03 Mar 2006 21:28:40 GMT
    [ ] 

Doug Cutting commented on LUCENE-502:

>  Which is true? Or, as it seems likely, TermScorer was optimized for a case that is no
longer valid (i.e. ConjunctiveScorer). 

No, it was optimized for BooleanScorer's *disjunctive* scoring algorithm, which is no longer
used by default, but is faster than BooleanScorer2's disjunctive scoring algorithm.  This
applies to a very common type of query: classic vector-space searches.  So this optimization
may not be leveraged much in the current codebase, but that does not mean that it is no longer
valid.  But it may slow other sorts of searches, like your wildcards.  The challenge is not
just how to figure out how to make your application as fast as possible, but how to do this
without making other's and future applications slower.

> In short, we should have two TermScorer implementations. One for low documents/term,
and one for high documents/term.

Yes, I think that would be useful.  Classically, total query processing time is dominated
by common terms, so that's an important case to optimize.  But It seems that with wildcard
queries over smaller collections that these optimizations become costly.  So two implementations
seems like it would make everyone happy.

> TermScorer caches values unnecessarily
> --------------------------------------
>          Key: LUCENE-502
>          URL:
>      Project: Lucene - Java
>         Type: Improvement
>   Components: Search
>     Versions: 1.9
>     Reporter: Steven Tamm
>  Attachments: TermScorer.patch
> TermScorer aggressively caches the doc and freq of 32 documents at a time for each term
scored.  When querying for a lot of terms, this causes a lot of garbage to be created that's
unnecessary.  The SegmentTermDocs from which it retrieves its information doesn't have any
optimizations for bulk loading, and it's unnecessary.
> In addition, it has a SCORE_CACHE, that's of limited benefit.  It's caching the result
of a sqrt that should be placed in DefaultSimilarity, and if you're only scoring a few documents
that contain those terms, there's no need to precalculate the SQRT, especially on modern VMs.
> Enclosed is a patch that replaces TermScorer with a version that does not cache the docs
or feqs.  In the case of a lot of queries, that saves 196 bytes/term, the unnecessary disk
IO, and extra SQRTs which adds up.

This message is automatically generated by JIRA.
If you think it was sent incorrectly contact one of the administrators:
For more information on JIRA, see:

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message