Return-Path: Delivered-To: apmail-lucene-java-dev-archive@www.apache.org Received: (qmail 3190 invoked from network); 13 Sep 2009 16:18:23 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.3) by minotaur.apache.org with SMTP; 13 Sep 2009 16:18:23 -0000 Received: (qmail 13244 invoked by uid 500); 13 Sep 2009 16:18:22 -0000 Delivered-To: apmail-lucene-java-dev-archive@lucene.apache.org Received: (qmail 13136 invoked by uid 500); 13 Sep 2009 16:18:22 -0000 Mailing-List: contact java-dev-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: java-dev@lucene.apache.org Delivered-To: mailing list java-dev@lucene.apache.org Received: (qmail 13128 invoked by uid 99); 13 Sep 2009 16:18:22 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Sun, 13 Sep 2009 16:18:22 +0000 X-ASF-Spam-Status: No, hits=-2000.0 required=10.0 tests=ALL_TRUSTED X-Spam-Check-By: apache.org Received: from [140.211.11.140] (HELO brutus.apache.org) (140.211.11.140) by apache.org (qpsmtpd/0.29) with ESMTP; Sun, 13 Sep 2009 16:18:19 +0000 Received: from brutus (localhost [127.0.0.1]) by brutus.apache.org (Postfix) with ESMTP id 8A2A4234C045 for ; Sun, 13 Sep 2009 09:17:57 -0700 (PDT) Message-ID: <1855049494.1252858677545.JavaMail.jira@brutus> Date: Sun, 13 Sep 2009 09:17:57 -0700 (PDT) From: "Doron Cohen (JIRA)" To: java-dev@lucene.apache.org Subject: [jira] Commented: (LUCENE-1908) Similarity javadocs for scoring function to relate more tightly to scoring models in effect In-Reply-To: <1529898827.1252698057805.JavaMail.jira@brutus> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 X-Virus-Checked: Checked by ClamAV on apache.org [ https://issues.apache.org/jira/browse/LUCENE-1908?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12754716#action_12754716 ] Doron Cohen commented on LUCENE-1908: ------------------------------------- {quote} The intro to ir book appears to break it down so that you can explain it with the math (why going into the unit vector space favors longer docs) - but other work I am seeing says the math tells you no such thing, and its just comparing it to the computed relevancy curve that tells you its not great. {quote} To my (current) understanding it goes like this: normalizing all V(d)'s to unit vector is losing all information about lengths of documents. For a large document made by duplicating a smaller one this is probably ok. For a large document which just contains lots of "unique" text this is probably wrong. To solve this, a different normalization is sometimes preferred, one that would not normalize V(d) to the unit vector. (very much in line with what you (Mark) wrote above, finally I understand this...). The pivoted length normalization which you mentioned is one different such normalization. Juru in fact is using this document length normalization. In our TREC experiments with Lucene we tried this approach (we modified Lucene indexing such that all require components were indexed as stored/cached fields and at search time we could try various scoring algorithms). It is interesting that pivoted length normalization did not work well - by our experiments - with Lucene for TREC. The document length normalization of Lucene's DefaultSimilarity (DS) now seems to me - intuitively - not so good - at least for the previously mentioned two edge cases, where doc1 is made of N distinct terms, and doc2 is made of same N distinct terms, but its length is 2N because each term appears twice. For doc1 DS will normalize to the unit vector same as EN, and for doc2 DS will normalize to a vector larger than the unit vector. However I think the desired behavior is the other way around - for doc2 to be the same as EN, and for doc1 to be normalized to a vector larger than the unit vector. Back to the documentation patch, again it seems wrong presenting as if both EU and some additional doc length normalization are required - fixed patch to follow... > Similarity javadocs for scoring function to relate more tightly to scoring models in effect > ------------------------------------------------------------------------------------------- > > Key: LUCENE-1908 > URL: https://issues.apache.org/jira/browse/LUCENE-1908 > Project: Lucene - Java > Issue Type: Improvement > Components: Search > Reporter: Doron Cohen > Assignee: Doron Cohen > Priority: Minor > Fix For: 2.9 > > Attachments: LUCENE-1908.patch, LUCENE-1908.patch > > > See discussion in the related issue. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. --------------------------------------------------------------------- To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org For additional commands, e-mail: java-dev-help@lucene.apache.org