Return-Path: Delivered-To: apmail-lucene-java-dev-archive@www.apache.org Received: (qmail 13782 invoked from network); 13 Nov 2008 03:44:11 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.2) by minotaur.apache.org with SMTP; 13 Nov 2008 03:44:11 -0000 Received: (qmail 18595 invoked by uid 500); 13 Nov 2008 03:44:12 -0000 Delivered-To: apmail-lucene-java-dev-archive@lucene.apache.org Received: (qmail 18547 invoked by uid 500); 13 Nov 2008 03:44:12 -0000 Mailing-List: contact java-dev-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: java-dev@lucene.apache.org Delivered-To: mailing list java-dev@lucene.apache.org Received: (qmail 18538 invoked by uid 99); 13 Nov 2008 03:44:12 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 12 Nov 2008 19:44:12 -0800 X-ASF-Spam-Status: No, hits=-2000.0 required=10.0 tests=ALL_TRUSTED X-Spam-Check-By: apache.org Received: from [140.211.11.140] (HELO brutus.apache.org) (140.211.11.140) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 13 Nov 2008 03:43:01 +0000 Received: from brutus (localhost [127.0.0.1]) by brutus.apache.org (Postfix) with ESMTP id 9EBCB234C281 for ; Wed, 12 Nov 2008 19:43:44 -0800 (PST) Message-ID: <1378698076.1226547824649.JavaMail.jira@brutus> Date: Wed, 12 Nov 2008 19:43:44 -0800 (PST) From: "Mark Miller (JIRA)" To: java-dev@lucene.apache.org Subject: [jira] Updated: (LUCENE-502) TermScorer caches values unnecessarily In-Reply-To: <1037836979.1141180479307.JavaMail.jira@ajax.apache.org> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-Virus-Checked: Checked by ClamAV on apache.org [ https://issues.apache.org/jira/browse/LUCENE-502?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mark Miller updated LUCENE-502: ------------------------------- Attachment: LUCENE-503.patch Are we interested in this optimization? Here is an attempted patch. Two issues: 1. Seems it might be better to try and use IDF to determine which scorer to use (TermScorer or LowFreqTermScorer) rather than doc freq so that doc freq doesn't need to be accessed twice. 2. I don't know at what 'level' the LowFreqTermScorer should be cut out for the TermScorer. Some benching may help. > TermScorer caches values unnecessarily > -------------------------------------- > > Key: LUCENE-502 > URL: https://issues.apache.org/jira/browse/LUCENE-502 > Project: Lucene - Java > Issue Type: Improvement > Components: Search > Affects Versions: 1.9 > Reporter: Steven Tamm > Priority: Minor > Attachments: LUCENE-503.patch, TermScorer.patch > > > TermScorer aggressively caches the doc and freq of 32 documents at a time for each term scored. When querying for a lot of terms, this causes a lot of garbage to be created that's unnecessary. The SegmentTermDocs from which it retrieves its information doesn't have any optimizations for bulk loading, and it's unnecessary. > In addition, it has a SCORE_CACHE, that's of limited benefit. It's caching the result of a sqrt that should be placed in DefaultSimilarity, and if you're only scoring a few documents that contain those terms, there's no need to precalculate the SQRT, especially on modern VMs. > Enclosed is a patch that replaces TermScorer with a version that does not cache the docs or feqs. In the case of a lot of queries, that saves 196 bytes/term, the unnecessary disk IO, and extra SQRTs which adds up. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. --------------------------------------------------------------------- To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org For additional commands, e-mail: java-dev-help@lucene.apache.org