Mailing-List: contact java-dev-help@lucene.apache.org; run by ezmlm
Precedence: bulk
Reply-To: java-dev@lucene.apache.org
Message-ID: <1378698076.1226547824649.JavaMail.jira@brutus>
Date: Wed, 12 Nov 2008 19:43:44 -0800 (PST)
From: "Mark Miller (JIRA)" <jira@apache.org>
To: java-dev@lucene.apache.org
Subject: [jira] Updated: (LUCENE-502) TermScorer caches values unnecessarily
In-Reply-To: <1037836979.1141180479307.JavaMail.jira@ajax.apache.org>
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: 7bit


     [ https://issues.apache.org/jira/browse/LUCENE-502?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Mark Miller updated LUCENE-502:
-------------------------------

    Attachment: LUCENE-503.patch

Are we interested in this optimization?

Here is an attempted patch. 

Two issues:

1. Seems it might be better to try and use IDF to determine which scorer to use (TermScorer or LowFreqTermScorer) rather than doc freq so that doc freq doesn't need to be accessed twice.

2. I don't know at what 'level' the LowFreqTermScorer should be cut out for the TermScorer. Some benching may help.

> TermScorer caches values unnecessarily
> --------------------------------------
>
>                 Key: LUCENE-502
>                 URL: https://issues.apache.org/jira/browse/LUCENE-502
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Search
>    Affects Versions: 1.9
>            Reporter: Steven Tamm
>            Priority: Minor
>         Attachments: LUCENE-503.patch, TermScorer.patch
>
>
> TermScorer aggressively caches the doc and freq of 32 documents at a time for each term scored.  When querying for a lot of terms, this causes a lot of garbage to be created that's unnecessary.  The SegmentTermDocs from which it retrieves its information doesn't have any optimizations for bulk loading, and it's unnecessary.
> In addition, it has a SCORE_CACHE, that's of limited benefit.  It's caching the result of a sqrt that should be placed in DefaultSimilarity, and if you're only scoring a few documents that contain those terms, there's no need to precalculate the SQRT, especially on modern VMs.
> Enclosed is a patch that replaces TermScorer with a version that does not cache the docs or feqs.  In the case of a lot of queries, that saves 196 bytes/term, the unnecessary disk IO, and extra SQRTs which adds up.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org