lucy-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "David E. Wheeler" <>
Subject Re: [lucy-dev] Simplifying the Query Parser
Date Sat, 16 Apr 2011 16:01:29 GMT
On Apr 15, 2011, at 10:02 PM, Marvin Humphrey wrote:

>  1. Generate a TermQuery with field 'foo' and term 'bar'.  This is the
>     current behavior, which we are ruling out because it makes it hard to
>     write a secure parser when you have sensitive fields.
>  2. Treat 'foo' as a distinct term, so that the query is parsed the same as
>     'foo bar'.
>  3. Treat 'foo:bar' as a single "leaf", which will then be expanded by
>     Expand_Leaf() and will be tokenized using field-specific Analyzers.
>     Most of the time, this will result in a PhraseQuery, as if you had typed
>     '"foo bar"'.
>  4. Generate a NoMatchQuery.
> Whatever option we choose, I hope that the parser can produce Queries which
> return sensible results for all of these:
>    PHP::Interpreter
>    10:30
> (Can others suggest more torture test query strings?)

Those are great examples. And given those, I think #3 is probably the best choice. In all
those cases, with the possible exception of mailto:, a phrase is what I would expect.

> Our QueryParser, unlike the Lucene QueryParser, is primarily designed as a
> user-facing parser -- it never throws parse errors, it supports only widely
> popular syntax, etc.  Options 2 and 3 are similar to what you get at Google
> today[1], and they are in the tolerant spirit of the current design.
> However, they are somewhat inconsistent from an interface design standpoint,
> and I worry that that makes QueryParser harder to grok and subclass.

This is largely a matter of precise documentation and a good API, though, yes? Also, is there
a strict, Lucene-style parser?

>  0.2.0
>    * Always heed colons.
>    * Make QParser_Set_Heed_Colons() a no-op and deprecate it in the
>      documentation.
>  0.3.0
>    * Remove QParser_Set_Heed_Colons().

And at what point would the application of one of the above four solutions be applied? I can
see arguments for 0.1.0 and 0.2.0.



View raw message