Return-Path: Delivered-To: apmail-lucene-lucy-dev-archive@minotaur.apache.org Received: (qmail 76917 invoked from network); 22 Mar 2010 01:50:45 -0000 Received: from unknown (HELO mail.apache.org) (140.211.11.3) by 140.211.11.9 with SMTP; 22 Mar 2010 01:50:45 -0000 Received: (qmail 44580 invoked by uid 500); 22 Mar 2010 01:50:42 -0000 Delivered-To: apmail-lucene-lucy-dev-archive@lucene.apache.org Received: (qmail 44510 invoked by uid 500); 22 Mar 2010 01:50:42 -0000 Mailing-List: contact lucy-dev-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: lucy-dev@lucene.apache.org Delivered-To: mailing list lucy-dev@lucene.apache.org Received: (qmail 44502 invoked by uid 99); 22 Mar 2010 01:50:42 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 22 Mar 2010 01:50:42 +0000 X-ASF-Spam-Status: No, hits=-0.0 required=10.0 tests=SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (nike.apache.org: local policy) Received: from [209.98.116.241] (HELO pekmac.local) (209.98.116.241) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 22 Mar 2010 01:50:34 +0000 Received: from pekmac.local (localhost [127.0.0.1]) by pekmac.local (Postfix) with ESMTP id 02F1517133C; Sun, 21 Mar 2010 20:50:10 -0500 (CDT) Message-ID: <4BA6CCD2.9060501@peknet.com> Date: Sun, 21 Mar 2010 20:50:10 -0500 From: Peter Karman User-Agent: Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10.6; en-US; rv:1.9.1.8) Gecko/20100227 Thunderbird/3.0.3 MIME-Version: 1.0 To: lucy-dev@lucene.apache.org CC: "KinoSearch discussion list." Subject: Re: ProximityQuery References: <4B9F01A8.7030506@peknet.com> <20100316044907.GA27885@rectangular.com> <4BA04918.101@peknet.com> <20100317160459.GA1854@rectangular.com> <4BA2E9D1.6050207@peknet.com> <20100319160639.GB16099@rectangular.com> <4BA3AA66.5020803@peknet.com> <20100319193503.GA17046@rectangular.com> <4BA43C25.8040406@peknet.com> <4BA5C455.20805@peknet.com> <20100321200704.GA27563@rectangular.com> In-Reply-To: <20100321200704.GA27563@rectangular.com> X-Enigmail-Version: 1.0.1 Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: 7bit X-Virus-Checked: Checked by ClamAV on apache.org Marvin Humphrey wrote on 3/21/10 3:07 PM: > On Sun, Mar 21, 2010 at 02:01:41AM -0500, Peter Karman wrote: >> Marvin, please have a look when you have a chance, and let me know what needs >> changing. > > The current implementation has a limitation I think is probably pretty > important: 'b NEAR a' doesn't return the same result set as 'a NEAR b'. > As you noted earlier in this thread, there is no concensus about what a proximity query is. :) I did consider the fact that proximity might imply that order does not matter. But I came down here: if I want order to matter, and the ProximityScorer ignores order as you're suggesting, then I have no options. I can't limit my search to 'a NEAR b'. If instead we leave the ProximityScorer as is, then this: (a NEAR b) OR (b NEAR a) does what you're describing. Consider too: (a NEAR b NEAR c) which might be written as: "a b c"~10 What order should I consider there? 'a' within 10 positions of 'b' and 'c'? or 'b' within 10 positions of 'a' and 'c'? or... You see how the possibilities multiply. I think simpler is better here: if you want order to not matter, then OR together the various orders you might be interested in. In fact, I may offer that as an option in the Search::Query::Parser, which could then do the ORing programmatically. Likewise, if we choose to support the "a b"~N syntax in the KS QueryParser, could do something similar. I note that one of the Lucene classes you mentioned earlier[0] makes inOrder an option. The Lucene PhraseScorer's slop feature, however, does seem to respect order with no option otherwise. [0] http://svn.apache.org/repos/asf/lucene/java/trunk/src/java/org/apache/lucene/search/spans/SpanNearQuery.java > > Superficial stylistic suggestion: I might propose "proximity" (or "nearness", > but "proximity" is better) instead of "near" for the name of that parameter. > Or alternately, "slop", but I understand why you went with nearness instead. I like 'proximity' for consistency's sake. And yes, 'near' is not quite right. How about 'within'? Or 'vicinity'? > >> In the end it was a one-line difference in the SI_winnow_anchors implementation >> to get the near/slop feature working. I left the original implementation intact >> and put a switch/case wrapper around it to leave the optimization (if any) >> intact for phrases (near==1). > > This doesn't technically need changing, but to cut down on the duplicated > code, the switch on self->near should theoretically happen here: ah yes, that's much better. -- Peter Karman . http://peknet.com/ . peter@peknet.com