lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Trejkaz <>
Subject Rewriting other query types into span queries and two questions about this
Date Fri, 05 Aug 2011 01:06:05 GMT
Hi all.

I am writing a custom query parser which strongly resembles
StandardQueryParser (I use a lot of the same processors and builders,
with a slightly customised config handler and a completely new syntax
parser written as an ANTLR grammar.)  My parser has additional syntax
for span queries.  The SyntaxParser is pretty much done and now I'm up
to the stage where I have to process this into a valid Query object.

Of course, span queries cannot accept any other kind of query inside
them (at least not yet - I realise work is already being done to unify
the two kinds of query), so any query the user might put inside there
needs to be transformed into an equivalent span query.  For some of
these, this is straight-forward

    TermQuery -> convert to SpanTermQuery
    WildcardQuery, PrefixQuery, FuzzyQuery, RegexQuery -> wrap in

For PhraseQuery and MultiPhraseQuery, as long as the slop is 0, it
seems like you can rewrite as follows:

    phrase-query( term-query('this'), term-query('is'),
term-query('my'), term-query('cat') ) -> span-near-query({slop=0,
forwards-only=true} span-term-query('this'), span-term-query('is'),
span-term-query('my'), span-term-query('cat') )

(For MultiPhraseQuery the inner queries would be rewritten to
SpanMultiTermQueryWrapper but aside from that, it's the same.)

When the slop is non-zero, I'm not sure what to do.  Does it still
translate directly?  I suspect not, because PhraseQuery slop is
asymmetrical (centred around the term *after* the previous match)
whereas SpanNearQuery slop is symmetrical (centred around the previous
match, although the term to either side is numbered 0 instead of 1 as
one might expect.)

Q1: Is there some way to (precisely) simulate phrase query behaviour in spans?

For boolean queries, it depends... If it's a pure OR query, you can
rewrite like this:

    within(2, 'my', or('cat', 'dog')) -> or( within(2, 'my', 'cat'),
within(2, 'my', 'dog') )

This doesn't appear to change the semantics of the query.  I notice
there is a SpanOrQuery as well, which I could probably use instead...
but it doesn't seem to make a difference.

For AND (and for any "default boolean" queries which aren't equivalent
to OR) queries, I have problems.  For instance, you can't do this:

    within(5, 'my', and('cat', 'dog')) -> and( within(5, 'my', 'cat'),
within(5, 'my', 'dog') )

The problem is that this changes the semantics - the original query
implies that the same "my" span is used when matching the other two,
whereas the rewritten form allows it to be any "my" in the document.
This problem doesn't exist with OR queries because it doesn't have to
match both terms.

Q2: Is there some way to "pin this down" such that the "my" matched by
each is the same position?


To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message