Return-Path: Delivered-To: apmail-lucene-java-dev-archive@www.apache.org Received: (qmail 11750 invoked from network); 12 Sep 2009 18:10:23 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.3) by minotaur.apache.org with SMTP; 12 Sep 2009 18:10:23 -0000 Received: (qmail 8652 invoked by uid 500); 12 Sep 2009 18:10:22 -0000 Delivered-To: apmail-lucene-java-dev-archive@lucene.apache.org Received: (qmail 8551 invoked by uid 500); 12 Sep 2009 18:10:22 -0000 Mailing-List: contact java-dev-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: java-dev@lucene.apache.org Delivered-To: mailing list java-dev@lucene.apache.org Received: (qmail 8543 invoked by uid 99); 12 Sep 2009 18:10:21 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Sat, 12 Sep 2009 18:10:21 +0000 X-ASF-Spam-Status: No, hits=-2000.0 required=10.0 tests=ALL_TRUSTED X-Spam-Check-By: apache.org Received: from [140.211.11.140] (HELO brutus.apache.org) (140.211.11.140) by apache.org (qpsmtpd/0.29) with ESMTP; Sat, 12 Sep 2009 18:10:18 +0000 Received: from brutus (localhost [127.0.0.1]) by brutus.apache.org (Postfix) with ESMTP id 749FB234C046 for ; Sat, 12 Sep 2009 11:09:57 -0700 (PDT) Message-ID: <1790268718.1252778997463.JavaMail.jira@brutus> Date: Sat, 12 Sep 2009 11:09:57 -0700 (PDT) From: "Mark Miller (JIRA)" To: java-dev@lucene.apache.org Subject: [jira] Commented: (LUCENE-1908) Similarity javadocs for scoring function to relate more tightly to scoring models in effect In-Reply-To: <1529898827.1252698057805.JavaMail.jira@brutus> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 X-Virus-Checked: Checked by ClamAV on apache.org [ https://issues.apache.org/jira/browse/LUCENE-1908?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12754577#action_12754577 ] Mark Miller commented on LUCENE-1908: ------------------------------------- bq. Is it really better? It seems to "punish" the same for length due to distinct terms, and to punish less for length due to duplicate terms. Is this really a desired behavior? My intuition says no, but I am not sure. Its only desired behavior if you have a corpus that favors it, but most do. I think you can think of the |V(d)| as taking out information about the document length - you start with an m space vector, with each term given a weight depending on how many times it occurs - in other words, there is information about the length of that document there, and when you normalize by |V(d)|, you will take out that information - but it will appear more similar the more unique terms it started with and the higher the tf's. So that method favors long docs, witch will naturally have more of both, though of course not always be more similar. All of the normalizations I have seen appear in the vein of |V(d)| -eg 1/root(something). All of the others also try and make up for this doc length problem - by messing with the curve so that ultra long docs are not favored too highly. Our default method is a known not very good one - buts its also very fast and efficient compared to the better ones. Sweetspot is much better and I think still efficient - but you need to tune the right params I believe. > Similarity javadocs for scoring function to relate more tightly to scoring models in effect > ------------------------------------------------------------------------------------------- > > Key: LUCENE-1908 > URL: https://issues.apache.org/jira/browse/LUCENE-1908 > Project: Lucene - Java > Issue Type: Improvement > Components: Search > Reporter: Doron Cohen > Assignee: Doron Cohen > Priority: Minor > Fix For: 2.9 > > Attachments: LUCENE-1908.patch > > > See discussion in the related issue. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. --------------------------------------------------------------------- To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org For additional commands, e-mail: java-dev-help@lucene.apache.org