lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Steven Tamm (JIRA)" <>
Subject [jira] Updated: (LUCENE-502) TermScorer caches values unnecessarily
Date Wed, 01 Mar 2006 05:34:41 GMT
     [ ]

Steven Tamm updated LUCENE-502:

    Attachment: TermScorer.patch

Here's the patch

Sorry about my lack of proofreading, I saved right as I was leaving work.  

The main point is that the look ahead caching done by TermScorer is unnecessary.  It is only
of benefit if you are scoring in a given locality (i.e. query doc 0, then 30, then 10, then
3, etc).  Nearly all use cases are sequential: the use of seek vs. next() is fine because
the underlying BufferedIndexInput has an efficient seek function for sequential access.  

Here's an HPROF run from a set of sequential wildcard searches (with many terms per search).
 Since this never performs sequential access on documents, the "cache" is completely unnecessary.

          percent          live          alloc'ed  stack class
 rank   self  accum     bytes objs     bytes  objs trace name
   29  0.79% 58.64%   1029312 7148   1801296 12509 387945 float[]
   30  0.79% 59.43%   1029312 7148   1801296 12509 387944 int[]
   31  0.79% 60.23%   1029312 7148   1801296 12509 387943 int[]

TRACE 387943:<init>($TermWeight.scorer($BooleanWeight.scorer(
TRACE 387944:<init>($TermWeight.scorer($BooleanWeight.scorer(
TRACE 387945:<init>($TermWeight.scorer($BooleanWeight.scorer(

> TermScorer caches values unnecessarily
> --------------------------------------
>          Key: LUCENE-502
>          URL:
>      Project: Lucene - Java
>         Type: Improvement
>   Components: Search
>     Versions: 1.9
>     Reporter: Steven Tamm
>  Attachments: TermScorer.patch
> TermScorer aggressively caches the doc and freq of 32 documents at a time for each term
scored.  When querying for a lot of terms, this causes a lot of garbage to be created that's
unnecessary.  The SegmentTermDocs from which it retrieves its information doesn't have any
optimizations for bulk loading, and it's unnecessary.
> In addition, it has a SCORE_CACHE, that's of limited benefit.  It's caching the result
of a sqrt that should be placed in DefaultSimilarity, and if you're only scoring a few documents
that contain those terms, there's no need to precalculate the SQRT, especially on modern VMs.
> Enclosed is a patch that replaces TermScorer with a version that does not cache the docs
or feqs.  In the case of a lot of queries, that saves 196 bytes/term, the unnecessary disk
IO, and extra SQRTs which adds up.

This message is automatically generated by JIRA.
If you think it was sent incorrectly contact one of the administrators:
For more information on JIRA, see:

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message