Return-Path: Delivered-To: apmail-lucene-java-dev-archive@www.apache.org Received: (qmail 86899 invoked from network); 27 Oct 2006 23:06:27 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.2) by minotaur.apache.org with SMTP; 27 Oct 2006 23:06:27 -0000 Received: (qmail 51160 invoked by uid 500); 27 Oct 2006 23:06:35 -0000 Delivered-To: apmail-lucene-java-dev-archive@lucene.apache.org Received: (qmail 51125 invoked by uid 500); 27 Oct 2006 23:06:35 -0000 Mailing-List: contact java-dev-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: java-dev@lucene.apache.org Delivered-To: mailing list java-dev@lucene.apache.org Received: (qmail 51112 invoked by uid 99); 27 Oct 2006 23:06:35 -0000 Received: from herse.apache.org (HELO herse.apache.org) (140.211.11.133) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 27 Oct 2006 16:06:35 -0700 X-ASF-Spam-Status: No, hits=0.0 required=10.0 tests= X-Spam-Check-By: apache.org Received: from [140.211.11.4] (HELO brutus.apache.org) (140.211.11.4) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 27 Oct 2006 16:06:23 -0700 Received: from brutus (localhost [127.0.0.1]) by brutus.apache.org (Postfix) with ESMTP id 4143C7142C1 for ; Fri, 27 Oct 2006 16:05:18 -0700 (PDT) Message-ID: <17551634.1161990318264.JavaMail.root@brutus> Date: Fri, 27 Oct 2006 16:05:18 -0700 (PDT) From: "Doron Cohen (JIRA)" To: java-dev@lucene.apache.org Subject: [jira] Updated: (LUCENE-697) Scorer.skipTo affects sloppyPhrase scoring In-Reply-To: <26397508.1161748396540.JavaMail.root@brutus> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-Virus-Checked: Checked by ClamAV on apache.org [ http://issues.apache.org/jira/browse/LUCENE-697?page=all ] Doron Cohen updated LUCENE-697: ------------------------------- Attachment: sloppy_phrase_skipTo.patch This was tricky, for me anyhow, but I think I found it. The difference in scoring between using next() to using skipTo() (or a combination of these two) is caused by two (valid) orders of the sorted PhrasePositions. Currently PhrasePositions sorting is defined by doc and position, where position already considers the offset of the term within the (phrase) query. If however two TermPosition have the same doc and same position, the sort takes no decision, which falls down to one valid sort (by current sort definition). The difference between using next() and skipTo() in this regard is that skipTo() always calls sort(), sorting the entire set, while next() only calls sort() at initialization and then maintain the sorting as part of the scoring process. This would be clearer with the following example - taken from Yonik's test case that is failing now: - Doc1: w1 w3 w2 w3 zz - Query: "w3 w2"~2 When starting scoring in this doc, both PhrasePositions pp(w3) and pp(w2) have doc(2)=doc(w3)=1. Note, that, for the second w3 that matches we would have pos(w2)=2+1=3 and pos(w3)=3+0=3. So, after scoring doc1("w3 w2"), if the sort result places pp(w2) at the top, we would also score for doc1("w3 w2"). However, if pp(w3) is placed by the sort at the top (==smallest), we would not score also for doc1("w3 w2"). Current behavior is inconsistent: skip() would take the two while next() won't, and I think it is possible to create a case where it would be the other way around. So definitely behavior should be made consistent. Next question to be asked is: Do we want to sum (or max) the frequency for both (or more cases)? I think yes, sum. To fix this I am changing PhrasePosition comparison, so that in case positions are equal, the actual doc position (ignoring offset in query phrase) is considered. In addition, I added missing calls to clear the priority queue before starting to sort and to mark that no more initialization is required when skipTo() is called. I tested with the sequence that Yonik added: - skip skip next next skip skip And also with the sequences: - skip skip skip skip skip skip - next next next next next next - skip next skip next skip next - next skip next skip next skip - next next skip skip next next The latter 5 cases are now commented out, the first case is in effect. This scoring code is still not feeling natural to me, so (actually as always) comments will be appreciated. - Doron > Scorer.skipTo affects sloppyPhrase scoring > ------------------------------------------ > > Key: LUCENE-697 > URL: http://issues.apache.org/jira/browse/LUCENE-697 > Project: Lucene - Java > Issue Type: Bug > Components: Search > Affects Versions: 2.0.0 > Reporter: Yonik Seeley > Assigned To: Doron Cohen > Attachments: sloppy_phrase_skipTo.patch > > > If you mix skipTo() and next(), you get different scores than what is returned to a hit collector. -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira --------------------------------------------------------------------- To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org For additional commands, e-mail: java-dev-help@lucene.apache.org