Return-Path: X-Original-To: apmail-lucene-java-user-archive@www.apache.org Delivered-To: apmail-lucene-java-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 345186941 for ; Fri, 5 Aug 2011 01:06:41 +0000 (UTC) Received: (qmail 57493 invoked by uid 500); 5 Aug 2011 01:06:38 -0000 Delivered-To: apmail-lucene-java-user-archive@lucene.apache.org Received: (qmail 57438 invoked by uid 500); 5 Aug 2011 01:06:38 -0000 Mailing-List: contact java-user-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: java-user@lucene.apache.org Delivered-To: mailing list java-user@lucene.apache.org Received: (qmail 57430 invoked by uid 99); 5 Aug 2011 01:06:37 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 05 Aug 2011 01:06:37 +0000 X-ASF-Spam-Status: No, hits=-0.7 required=5.0 tests=RCVD_IN_DNSWL_LOW,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: domain of trejkaz@trypticon.org designates 74.125.83.48 as permitted sender) Received: from [74.125.83.48] (HELO mail-gw0-f48.google.com) (74.125.83.48) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 05 Aug 2011 01:06:28 +0000 Received: by gwj22 with SMTP id 22so809593gwj.35 for ; Thu, 04 Aug 2011 18:06:07 -0700 (PDT) Received: by 10.151.156.18 with SMTP id i18mr2634261ybo.5.1312506367760; Thu, 04 Aug 2011 18:06:07 -0700 (PDT) Received: from mail-yw0-f48.google.com (mail-yw0-f48.google.com [209.85.213.48]) by mx.google.com with ESMTPS id a16sm18414ybn.17.2011.08.04.18.06.06 (version=SSLv3 cipher=OTHER); Thu, 04 Aug 2011 18:06:06 -0700 (PDT) Received: by ywm3 with SMTP id 3so2081243ywm.35 for ; Thu, 04 Aug 2011 18:06:06 -0700 (PDT) MIME-Version: 1.0 Received: by 10.150.32.15 with SMTP id f15mr2685195ybf.330.1312506365967; Thu, 04 Aug 2011 18:06:05 -0700 (PDT) Received: by 10.150.211.4 with HTTP; Thu, 4 Aug 2011 18:06:05 -0700 (PDT) Date: Fri, 5 Aug 2011 11:06:05 +1000 Message-ID: Subject: Rewriting other query types into span queries and two questions about this From: Trejkaz To: Lucene Users Mailing List Content-Type: text/plain; charset=UTF-8 Hi all. I am writing a custom query parser which strongly resembles StandardQueryParser (I use a lot of the same processors and builders, with a slightly customised config handler and a completely new syntax parser written as an ANTLR grammar.) My parser has additional syntax for span queries. The SyntaxParser is pretty much done and now I'm up to the stage where I have to process this into a valid Query object. Of course, span queries cannot accept any other kind of query inside them (at least not yet - I realise work is already being done to unify the two kinds of query), so any query the user might put inside there needs to be transformed into an equivalent span query. For some of these, this is straight-forward TermQuery -> convert to SpanTermQuery WildcardQuery, PrefixQuery, FuzzyQuery, RegexQuery -> wrap in SpanMultiTermQueryWrapper For PhraseQuery and MultiPhraseQuery, as long as the slop is 0, it seems like you can rewrite as follows: phrase-query( term-query('this'), term-query('is'), term-query('my'), term-query('cat') ) -> span-near-query({slop=0, forwards-only=true} span-term-query('this'), span-term-query('is'), span-term-query('my'), span-term-query('cat') ) (For MultiPhraseQuery the inner queries would be rewritten to SpanMultiTermQueryWrapper but aside from that, it's the same.) When the slop is non-zero, I'm not sure what to do. Does it still translate directly? I suspect not, because PhraseQuery slop is asymmetrical (centred around the term *after* the previous match) whereas SpanNearQuery slop is symmetrical (centred around the previous match, although the term to either side is numbered 0 instead of 1 as one might expect.) Q1: Is there some way to (precisely) simulate phrase query behaviour in spans? For boolean queries, it depends... If it's a pure OR query, you can rewrite like this: within(2, 'my', or('cat', 'dog')) -> or( within(2, 'my', 'cat'), within(2, 'my', 'dog') ) This doesn't appear to change the semantics of the query. I notice there is a SpanOrQuery as well, which I could probably use instead... but it doesn't seem to make a difference. For AND (and for any "default boolean" queries which aren't equivalent to OR) queries, I have problems. For instance, you can't do this: within(5, 'my', and('cat', 'dog')) -> and( within(5, 'my', 'cat'), within(5, 'my', 'dog') ) The problem is that this changes the semantics - the original query implies that the same "my" span is used when matching the other two, whereas the rewritten form allows it to be any "my" in the document. This problem doesn't exist with OR queries because it doesn't have to match both terms. Q2: Is there some way to "pin this down" such that the "my" matched by each is the same position? TX --------------------------------------------------------------------- To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org For additional commands, e-mail: java-user-help@lucene.apache.org