Return-Path: X-Original-To: apmail-lucene-dev-archive@www.apache.org Delivered-To: apmail-lucene-dev-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 9F40C9A7E for ; Tue, 6 Mar 2012 21:56:23 +0000 (UTC) Received: (qmail 22557 invoked by uid 500); 6 Mar 2012 21:56:22 -0000 Delivered-To: apmail-lucene-dev-archive@lucene.apache.org Received: (qmail 22485 invoked by uid 500); 6 Mar 2012 21:56:22 -0000 Mailing-List: contact dev-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@lucene.apache.org Delivered-To: mailing list dev@lucene.apache.org Received: (qmail 22471 invoked by uid 99); 6 Mar 2012 21:56:21 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 06 Mar 2012 21:56:21 +0000 X-ASF-Spam-Status: No, hits=-2000.0 required=5.0 tests=ALL_TRUSTED,T_RP_MATCHES_RCVD X-Spam-Check-By: apache.org Received: from [140.211.11.116] (HELO hel.zones.apache.org) (140.211.11.116) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 06 Mar 2012 21:56:19 +0000 Received: from hel.zones.apache.org (hel.zones.apache.org [140.211.11.116]) by hel.zones.apache.org (Postfix) with ESMTP id D0259BB02 for ; Tue, 6 Mar 2012 21:55:57 +0000 (UTC) Date: Tue, 6 Mar 2012 21:55:57 +0000 (UTC) From: "Doron Cohen (Commented) (JIRA)" To: dev@lucene.apache.org Message-ID: <1837501910.29811.1331070957854.JavaMail.tomcat@hel.zones.apache.org> In-Reply-To: <1961499182.12192.1330035830036.JavaMail.tomcat@hel.zones.apache.org> Subject: [jira] [Commented] (LUCENE-3821) SloppyPhraseScorer sometimes misses documents that ExactPhraseScorer finds. MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 X-Virus-Checked: Checked by ClamAV on apache.org [ https://issues.apache.org/jira/browse/LUCENE-3821?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13223700#comment-13223700 ] Doron Cohen commented on LUCENE-3821: ------------------------------------- I'm afraid it won't solve the problem. The complicity of SloppyPhraseScorer stems firstly from the slop. That part is handled in the scorer for long time. Two additional complications are repeating terms, and multi-term phrases. Each one of these, separately, is handled as well. Their combination however, is the cause for this discussion. To prevent two repeating terms from landing on the same document position, we propagate the smaller of them (smaller in its phrase-position, which takes into account both the doc-position and the offset of that term in the query). Without this special treatment, a phrase query "a b a"~2 might match a document "a b", because both "a"'s (query terms) will land on the same document's "a". This is illegal and is prevented by such propagation. But when one of the repeating terms is a multi-term, it is not possible to know which of the repeating terms to propagate. This is the unsolved bug. Now, back to current ExactPhraseScorer. It does not have this problem with repeating terms. But not because of the different algorithm - rather because of the different scenario. It does not have this problem because exact phrase scoring does not have it. In exact phrase scoring, a match is declared only when all PPs are in the same phrase position. Recall that phrase position = doc-position - query-offset, it is visible that when two PPs with different query offset are in the same phrase-position, their doc-position cannot be the same, and therefore no special handling is needed for repeating terms in exact phrase scorers. However, once we will add that slopy-decaying frequency, we will match in certain posIndex, different phrase-positions. This is because of the slop. So they might land on the same doc-position, and then we start again... This is really too bad. Sorry for the lengthy post, hopefully this would help when someone wants to get into this. Back to option 2. > SloppyPhraseScorer sometimes misses documents that ExactPhraseScorer finds. > --------------------------------------------------------------------------- > > Key: LUCENE-3821 > URL: https://issues.apache.org/jira/browse/LUCENE-3821 > Project: Lucene - Java > Issue Type: Bug > Affects Versions: 3.5, 4.0 > Reporter: Naomi Dushay > Assignee: Doron Cohen > Attachments: LUCENE-3821-SloppyDecays.patch, LUCENE-3821.patch, LUCENE-3821.patch, LUCENE-3821.patch, LUCENE-3821.patch, LUCENE-3821_test.patch, schema.xml, solrconfig-test.xml > > > The general bug is a case where a phrase with no slop is found, > but if you add slop its not. > I committed a test today (TestSloppyPhraseQuery2) that actually triggers this case, > jenkins just hasn't had enough time to chew on it. > ant test -Dtestcase=TestSloppyPhraseQuery2 -Dtests.iter=100 is enough to make it fail on trunk or 3.x -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org For additional commands, e-mail: dev-help@lucene.apache.org