lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Adriano Crestani <adrianocrest...@gmail.com>
Subject Re: New flexible query parser
Date Tue, 17 Mar 2009 00:06:43 GMT
Hi everyone,

Very interesting. Can this parser solve the Lucene query syntax precedence
issues? Would be great to match the current syntax with full precedence
support.

Definitely. Actually, the one we have today already has precedence, and to
pass on all Lucene test cases I had to write a processor that removes this
precedence and mimic the Lucene non-precedence behavior. It was easy like
this: write a processor and insert it into the processors chain, piece of
cake : )

Many of the things that it would be nice to do (perhaps add span support to
> the standard syntax with an on/off toggle?, etc) is very difficult to build
> on the current architecture. What you describe indicates these type of
> things might becomes easier than they are today.
>
>  Yes, all these things should be much easier to add compared to the old QP.

Yes, as Michael already said, much easier with this new architecture. We
would basically need to change the QueryParser so it supports any new syntax
(if there is any new)...write one or more processors to handle any new logic
(if there is any) and create a new SpanQueryBuilder that would create
SpansQuery object instead of the regular Query objects, so we can switch
between the SpanQueryBuilder  and the regular QueryBuilder whenever we want
to generate SpanQuery or regular Query objects. The point is that this new
architecture very flexible and incremental, completely different from the
one Lucene has today.

Best Regards,
Adriano Crestani Campos

On Mon, Mar 16, 2009 at 4:49 PM, Michael Busch <buschmic@gmail.com> wrote:

> On 3/17/09 12:39 AM, Mark Miller wrote:
>
>> Very interesting. Can this parser solve the Lucene query syntax precedence
>> issues? Would be great to match the current syntax with full precedence
>> support.
>>
>>  Yes. In fact in our product we use a slightly different query syntax. It
> has operator precendence, and also <=, >= syntax for range queries. (which
> was wished for in a different thread here...)
>
>  It sounds like a great bit of work to move forward too - I'll be the first
>> to sound in that the current implementation could use improvement, and your
>> implementation sounds great in prose. Would be nice to skim the code though.
>>
>>  We're preparing a patch - should be ready soon.
>
>  Many of the things that it would be nice to do (perhaps add span support
>> to the standard syntax with an on/off toggle?, etc) is very difficult to
>> build on the current architecture. What you describe indicates these type of
>> things might becomes easier than they are today.
>>
>>  Yes, all these things should be much easier to add compared to the old
> QP.
>
>  My vote for contrib would depend on the state of the code - if it passes
>> all the tests and is truly back compat, and is not crazy slower, I don't see
>> why we don't move it in right away depending on confidence levels. That
>> would ensure use and attention that contrib often misses. The old parser
>> could hang around in deprecation.
>>
> I think we can postpone this decision until we have submitted the code and
> gotten some feedback. I personally think this is pretty solid code with good
> unit tests and documentation. So I'd also be fine with adding it to the
> core.
>
>
>
>> - Mark
>>
>> Michael Busch wrote:
>>
>>> Hello,
>>>
>>> in my team at IBM we have used a different query parser than Lucene's in
>>> our products for quite a while. Recently we spent a significant amount
>>> of time in refactoring the code and designing a very generic
>>> architecture, so that this query parser can be easily used for different
>>> products with varying query syntaxes.
>>>
>>> This work was originally driven by Andreas Neumann (who, however, left
>>> our team); most of the code was written by Luis Alves, who has been a
>>> bit active in Lucene in the past, and Adriano Campos, who joined our
>>> team at IBM half a year ago. Adriano is Apache committer and PMC member
>>> on the Tuscany project and getting familiar with Lucene now too.
>>>
>>> We think this code is much more flexible and extensible than the current
>>> Lucene query parser, and would therefore like to contribute it to
>>> Lucene. I'd like to give a very brief architecture overview here,
>>> Adriano and Luis can then answer more detailed questions as they're much
>>> more familiar with the code than I am.
>>> The goal was it to separate syntax and semantics of a query. E.g. 'a AND
>>> b', '+a +b', 'AND(a,b)' could be different syntaxes for the same query.
>>> We distinguish the semantics of the different query components, e.g.
>>> whether and how to tokenize/lemmatize/normalize the different terms or
>>> which Query objects to create for the terms. We wanted to be able to
>>> write a parser with a new syntax, while reusing the underlying
>>> semantics, as quickly as possible.
>>> In fact, Adriano is currently working on a 100% Lucene-syntax compatible
>>> implementation to make it easy for people who are using Lucene's query
>>> parser to switch.
>>>
>>> The query parser has three layers and its core is what we call the
>>> QueryNodeTree. It is a tree that initially represents the syntax of the
>>> original query, e.g. for 'a AND b':
>>>  AND
>>>  /   \
>>> A     B
>>>
>>> The three layers are:
>>> 1. QueryParser
>>> 2. QueryNodeProcessor
>>> 3. QueryBuilder
>>>
>>> 1. The upper layer is the parsing layer which simply transforms the
>>> query text string into a QueryNodeTree. Currently our implementations of
>>> this layer use javacc.
>>> 2. The query node processors do most of the work. It is in fact a
>>> configurable chain of processors. Each processors can walk the tree and
>>> modify nodes or even the tree's structure. That makes it possible to
>>> e.g. do query optimization before the query is executed or to tokenize
>>> terms.
>>> 3. The third layer is also a configurable chain of builders, which
>>> transform the QueryNodeTree into Lucene Query objects.
>>>
>>> Furthermore the query parser uses flexible configuration objects, which
>>> are based on AttributeSource/Attribute. It also uses message classes that
>>> allow to attach resource bundles. This makes it possible to translate
>>> messages, which is an important feature of a query parser.
>>>
>>> This design allows us to develop different query syntaxes very quickly.
>>> Adriano wrote the Lucene-compatible syntax in a matter of hours, and the
>>> underlying processors and builders in a few days. We now have a 100%
>>> compatible Lucene query parser, which means the syntax is identical and
>>> all query parser test cases pass on the new one too using a wrapper.
>>>
>>>
>>> Recent posts show that there is demand for query syntax improvements,
>>> e.g improved range query syntax or operator precedence. There are
>>> already different QP implementations in Lucene+contrib, however I think
>>> we did not keep them all up to date and in sync. This is not too
>>> surprising, because usually when fixes and changes are made to the main
>>> query parser, people don't make the corresponding changes in the contrib
>>> parsers. (I'm guilty here too)
>>> With this new architecture it will be much easier to maintain different
>>> query syntaxes, as the actual code for the first layer is not very much.
>>> All syntaxes would benefit from patches and improvements we make to the
>>> underlying layers, which will make supporting different syntaxes much
>>> more manageable.
>>>
>>> So if there is interest we would like to contribute this work to Lucene.
>>> I think the amount of code (~6K LOC) is higher than in a usual patch,
>>> but also lower than some contrib modules. So I'm not sure if we could
>>> contribute it as a normal patch or maybe a software grant?
>>> We could also maybe think about adding it as a contrib module initially,
>>> and if people like it move it to the core at a later point. I'd actually
>>> prefer this approach over committing to the core directly, as it would
>>> make it easier to make Luis and Adriano contrib committers on the new
>>> module, which of course makes sense as nobody knows the code better than
>>> they do.
>>>
>>> -Michael
>>>
>>>
>>>
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
>>> For additional commands, e-mail: java-dev-help@lucene.apache.org
>>>
>>>
>>
>>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-dev-help@lucene.apache.org
>
>

Mime
View raw message