lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Doug Cutting (JIRA)" <j...@apache.org>
Subject [jira] Commented: (LUCENE-502) TermScorer caches values unnecessarily
Date Fri, 03 Mar 2006 19:52:39 GMT
    [ http://issues.apache.org/jira/browse/LUCENE-502?page=comments#action_12368782 ] 

Doug Cutting commented on LUCENE-502:
-------------------------------------

> The question is how does the caching help when you have multiple documents.  My analysis
is that (with a modern VM) it helps you only if the docFreq of a term is 16-31 and you are
using a ConjunctiveScorer (i.e. not Wildcard searches).

The conjunctive scorer does not call score(HitCollector,int).  This is only called in a few
cases anymore.  It can help a lot with a single-term query for a very common term, or for
disjunctive queries involving very common terms, although BooleanScorer2 no longer uses it
in this case.  That's too bad.  If all clauses to a query are optional, then the old BooleanScorer
was faster.  But it didn't always return documents in order...  So it indeed may be time to
retire this method.

>SegmentTermDocs.read(int[], int[]) is no different from calling SegmentTermDocs.next()
32 times.

If that were the case, then then termDocs(int[], int[]) method would never have been added!
 Benchmarking showed this to be much faster.   There's also optimized C++ code that implements
this method in src/gcj.  In C++, with a memory-mapped index, the i/o completely inlines. 
When I last benchmarked this in GCJ, it was twice as fast as anything HotSpot could do.

But without score(HitCollector,int), TermDocs.read(int[], int[]) will never be called.  Sigh.

As for the scoreCache, this is certainly useful for terms that occur in thousands of documents,
and useless for terms that occur only once.  Perhaps we should have two TermScorer implementations,
one for common terms and one for rare terms, and have TermWeight select which to use.

> TermScorer caches values unnecessarily
> --------------------------------------
>
>          Key: LUCENE-502
>          URL: http://issues.apache.org/jira/browse/LUCENE-502
>      Project: Lucene - Java
>         Type: Improvement
>   Components: Search
>     Versions: 1.9
>     Reporter: Steven Tamm
>  Attachments: TermScorer.patch
>
> TermScorer aggressively caches the doc and freq of 32 documents at a time for each term
scored.  When querying for a lot of terms, this causes a lot of garbage to be created that's
unnecessary.  The SegmentTermDocs from which it retrieves its information doesn't have any
optimizations for bulk loading, and it's unnecessary.
> In addition, it has a SCORE_CACHE, that's of limited benefit.  It's caching the result
of a sqrt that should be placed in DefaultSimilarity, and if you're only scoring a few documents
that contain those terms, there's no need to precalculate the SQRT, especially on modern VMs.
> Enclosed is a patch that replaces TermScorer with a version that does not cache the docs
or feqs.  In the case of a lot of queries, that saves 196 bytes/term, the unnecessary disk
IO, and extra SQRTs which adds up.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Mime
View raw message