Return-Path: Delivered-To: apmail-lucene-java-dev-archive@www.apache.org Received: (qmail 42084 invoked from network); 12 Sep 2009 05:39:21 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.3) by minotaur.apache.org with SMTP; 12 Sep 2009 05:39:21 -0000 Received: (qmail 47883 invoked by uid 500); 12 Sep 2009 05:39:20 -0000 Delivered-To: apmail-lucene-java-dev-archive@lucene.apache.org Received: (qmail 47792 invoked by uid 500); 12 Sep 2009 05:39:20 -0000 Mailing-List: contact java-dev-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: java-dev@lucene.apache.org Delivered-To: mailing list java-dev@lucene.apache.org Received: (qmail 47784 invoked by uid 99); 12 Sep 2009 05:39:19 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Sat, 12 Sep 2009 05:39:19 +0000 X-ASF-Spam-Status: No, hits=-2000.0 required=10.0 tests=ALL_TRUSTED X-Spam-Check-By: apache.org Received: from [140.211.11.140] (HELO brutus.apache.org) (140.211.11.140) by apache.org (qpsmtpd/0.29) with ESMTP; Sat, 12 Sep 2009 05:39:17 +0000 Received: from brutus (localhost [127.0.0.1]) by brutus.apache.org (Postfix) with ESMTP id A7B7B234C052 for ; Fri, 11 Sep 2009 22:38:57 -0700 (PDT) Message-ID: <1753716727.1252733937685.JavaMail.jira@brutus> Date: Fri, 11 Sep 2009 22:38:57 -0700 (PDT) From: "Mark Miller (JIRA)" To: java-dev@lucene.apache.org Subject: [jira] Issue Comment Edited: (LUCENE-1908) Similarity javadocs for scoring function to relate more tightly to scoring models in effect In-Reply-To: <1529898827.1252698057805.JavaMail.jira@brutus> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 X-Virus-Checked: Checked by ClamAV on apache.org [ https://issues.apache.org/jira/browse/LUCENE-1908?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12754470#action_12754470 ] Mark Miller edited comment on LUCENE-1908 at 9/11/09 10:37 PM: --------------------------------------------------------------- Looks great! bq. Document Euclidean norm |V(d)| is excluded by Lucene, for indexing performance considerstions (?). Hmm - I'm not sure if that is right either. Are we not replacing the |V(d)| normalization factor with our document length factor? That's how it appears to me anyway - for |V(d)| you have many options right? the cosine normalization - your standard euclidean length - |V(d)| or none (eg 1) or pivoted normalized doc length or SweetSpotSimilarity's formula or the quick,dirty,easy, not great default doc length normalization that Lucene uses by default or Okapi's formula, or ... So we are replacing (which everyone generally does) not dropping right? And I don't think we are replacing for performance reasons (though it is complicated to calculate) - we are replacing because its not a great normalization factor. Using |V(d)| eliminates info on the length of the orig document - but longer documents will still have higher tf's and more distinct terms - so it unnaturally gives them an advantage (most long docs will be repeated pieces or cover multiple topics). So its not generally a good normalization factor, and we have a chosen another? (the one we have chosen isnt great either - long docs are punished too much and short preferred too much) Again, I'm not an IR guy, but thats what my modest take is. * edit: I suppose you could argue that you could do cosine normalization and then a further normalization approach on that, and in that sense we are dropping the cosine normalization because its too expensive. But from what I can see, it appears more the case that you try and use a normalization factor that can just replace cosine normalization - like the pivoted normalization which rotates the cosine normalization curve. I think pivoted is something like 1/root(stuff, ie unique terms), so our norm of 1/root(L) is of a similar, simpler, vein. So thats why I think we are not really dropping it - we are choosing one of the variety of replacements. was (Author: markrmiller@gmail.com): Looks great! bq. Document Euclidean norm |V(d)| is excluded by Lucene, for indexing performance considerstions (?). Hmm - I'm not sure if that is right either. Are we not replacing the |V(d)| normalization factor with our document length factor? That's how it appears to me anyway - for |V(d)| you have many options right? the cosine normalization - your standard euclidean length - |V(d)| or none (eg 1) or pivoted normalized doc length or SweetSpotSimilarity's formula or the quick,dirty,easy, not great default doc length normalization that Lucene uses by default or Okapi's formula, or ... So we are replacing (which everyone generally does) not dropping right? And I don't think we are replacing for performance reasons (though it is complicated to calculate) - we are replacing because its not a great normalization factor. Using |V(d)| eliminates info on the length of the orig document - but longer documents will still have higher tf's and more distinct terms - so it unnaturally gives them an advantage (most long docs will be repeated pieces or cover multiple topics). So its not generally a good normalization factor, and we have a chosen another? (the one we have chosen isnt great either - long docs are punished too much and short preferred too much) Again, I'm not an IR guy, but thats what my modest take is. > Similarity javadocs for scoring function to relate more tightly to scoring models in effect > ------------------------------------------------------------------------------------------- > > Key: LUCENE-1908 > URL: https://issues.apache.org/jira/browse/LUCENE-1908 > Project: Lucene - Java > Issue Type: Improvement > Components: Search > Reporter: Doron Cohen > Assignee: Doron Cohen > Priority: Minor > Fix For: 2.9 > > Attachments: LUCENE-1908.patch > > > See discussion in the related issue. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. --------------------------------------------------------------------- To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org For additional commands, e-mail: java-dev-help@lucene.apache.org