Return-Path: X-Original-To: apmail-lucene-dev-archive@www.apache.org Delivered-To: apmail-lucene-dev-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 21C3A969B for ; Fri, 9 Mar 2012 21:33:21 +0000 (UTC) Received: (qmail 67404 invoked by uid 500); 9 Mar 2012 21:33:19 -0000 Delivered-To: apmail-lucene-dev-archive@lucene.apache.org Received: (qmail 67338 invoked by uid 500); 9 Mar 2012 21:33:19 -0000 Mailing-List: contact dev-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@lucene.apache.org Delivered-To: mailing list dev@lucene.apache.org Received: (qmail 67330 invoked by uid 99); 9 Mar 2012 21:33:19 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 09 Mar 2012 21:33:19 +0000 X-ASF-Spam-Status: No, hits=-2000.0 required=5.0 tests=ALL_TRUSTED,T_RP_MATCHES_RCVD X-Spam-Check-By: apache.org Received: from [140.211.11.116] (HELO hel.zones.apache.org) (140.211.11.116) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 09 Mar 2012 21:33:18 +0000 Received: from hel.zones.apache.org (hel.zones.apache.org [140.211.11.116]) by hel.zones.apache.org (Postfix) with ESMTP id 719FF16EB2 for ; Fri, 9 Mar 2012 21:32:58 +0000 (UTC) Date: Fri, 9 Mar 2012 21:32:58 +0000 (UTC) From: "Doron Cohen (Commented) (JIRA)" To: dev@lucene.apache.org Message-ID: <1912103463.45676.1331328778466.JavaMail.tomcat@hel.zones.apache.org> In-Reply-To: <1961499182.12192.1330035830036.JavaMail.tomcat@hel.zones.apache.org> Subject: [jira] [Commented] (LUCENE-3821) SloppyPhraseScorer sometimes misses documents that ExactPhraseScorer finds. MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 X-Virus-Checked: Checked by ClamAV on apache.org [ https://issues.apache.org/jira/browse/LUCENE-3821?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13226494#comment-13226494 ] Doron Cohen commented on LUCENE-3821: ------------------------------------- {quote} Not understanding really how SloppyPhraseScorer works now, but not trying to add confusion to the issue, I can't help but think this problem is a variant on LevensteinAutomata... in fact that was the motivation for the new test, i just stole the testing methodology from there and applied it to this! {quote} Interesting! I was not aware of this. I went and read some about this automaton, It is relevant. {quote} It seems many things are the same but with a few twists: * fundamentally we are interleaving the streams from the subscorers into the 'index automaton' 'query automaton' is produced from the user-supplied terms {quote} True. In fact, the current code works hard to decide on the "correct interleaving order" - while if we had a "Perfect Levenstein Automaton" that took care of the computation we could just interleave, in the term position order (forget about phrase position and all that) and let the automaton compute the distance. This might capture the difficulty in making the sloppy phrase scorer correct: it started with the algorithm that was optimized for exact matching, and adopted (hacked?) it for approximate matching. Instead, starting with a model that fits approximate matching, might be easier and cleaner. I like that. {quote} * our 'alphabet' is the terms, and holes from position increment are just an additional symbol. * just like the LevensteinAutomata case, repeats are problematic because they are different characteristic vectors * stacked terms at the same position (index or query) just make the automata more complex (so they arent just strings) I'm not suggesting we try to re-use any of that code at all, i don't think it will work. But I wonder if we can re-use even some of the math to redefine the problem more formally to figure out what minimal state/lookahead we need for example... {quote} I agree. I'll think of this. In the meantime I'll commit this partial fix. > SloppyPhraseScorer sometimes misses documents that ExactPhraseScorer finds. > --------------------------------------------------------------------------- > > Key: LUCENE-3821 > URL: https://issues.apache.org/jira/browse/LUCENE-3821 > Project: Lucene - Java > Issue Type: Bug > Affects Versions: 3.5, 4.0 > Reporter: Naomi Dushay > Assignee: Doron Cohen > Attachments: LUCENE-3821-SloppyDecays.patch, LUCENE-3821.patch, LUCENE-3821.patch, LUCENE-3821.patch, LUCENE-3821.patch, LUCENE-3821_test.patch, schema.xml, solrconfig-test.xml > > > The general bug is a case where a phrase with no slop is found, > but if you add slop its not. > I committed a test today (TestSloppyPhraseQuery2) that actually triggers this case, > jenkins just hasn't had enough time to chew on it. > ant test -Dtestcase=TestSloppyPhraseQuery2 -Dtests.iter=100 is enough to make it fail on trunk or 3.x -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org For additional commands, e-mail: dev-help@lucene.apache.org