lucy-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Nathan Kurz <n...@verse.com>
Subject Re: [lucy-dev] Simplifying the Query Parser
Date Mon, 18 Apr 2011 20:48:56 GMT
On Sat, Apr 16, 2011 at 9:01 AM, David E. Wheeler <david@kineticode.com> wrote:
> On Apr 15, 2011, at 10:02 PM, Marvin Humphrey wrote:
>
>>  1. Generate a TermQuery with field 'foo' and term 'bar'.  This is the
>>     current behavior, which we are ruling out because it makes it hard to
>>     write a secure parser when you have sensitive fields.
>>  2. Treat 'foo' as a distinct term, so that the query is parsed the same as
>>     'foo bar'.
>>  3. Treat 'foo:bar' as a single "leaf", which will then be expanded by
>>     Expand_Leaf() and will be tokenized using field-specific Analyzers.
>>     Most of the time, this will result in a PhraseQuery, as if you had typed
>>     '"foo bar"'.
>>  4. Generate a NoMatchQuery.
>>
>> Whatever option we choose, I hope that the parser can produce Queries which
>> return sensible results for all of these:
>>
>>    http://www.apache.org/
>>    mailto:me@example.com
>>    PHP::Interpreter
>>    10:30
>>
>> (Can others suggest more torture test query strings?)
>
> Those are great examples. And given those, I think #3 is probably the best choice. In
all those cases, with the possible exception of mailto:, a phrase is what I would expect.

This seems simplest to me:  Outside of quotes, QueryParser breaks on
whitespace and everything else is handled by a lower level.

I also agree that the goal is to make the C-level parser able to
handle all the needs for 80% of the use cases.  I'd further suggest
the goal of making it possible to handle 98% of the needs at the
scripting level language without needing to write any C.   Then if
there turn out to be cases that are very popular, we can back-port to
a subclassed C-level QueryParser so that all the scripting languages
can just bind to that.

Where I disagree is how 80% is being defined.  I'd guess that 80% of
users aren't going to care about whether we support explicit fields in
queries, and that of the 20% who do, only a tiny number will care
about preventing searches on hidden fields.  Do you really see
supporting parenthesized Boolean expressions involving a mixture of
secure and insecure explicit field names as being a majority use case?

Presuming we've eliminated the "strict parser, transform your query as
text" approach, I think there are two separate routes to follow:  we
either want a complex but configurable QueryParser that meets almost
all needs, with features being added to it over time; or we have an
extremely simple QueryParser that is designed for future extension.
It feels like right now we have a unhealthy blend of the two.  It
feels like we are saying we want simple, but pursuing something very
complex.

Although I can see the appeal of the configurable approach, I'm in the
"simple means simple, shut up and eat your dogfood" camp.   Rather
than seeing optional parameters, I'd rather see easy to create special
purpose subclasses.   I see Set_Heed_Colons as code smell,
"default_boolop" as clumsy, and "fields" as over engineering.

I'd love to see a base query parser that does nothing but handle
quotes and keywords, and then a scripting level extension that handles
query fields.  Then write another one that adds in '+' and '-' flags,
and figure out how to merge them.  Then rather than incorporating
these changes into the supposedly simple base parser, I want a
subclassed FieldQueryParser that adds this functionality without
requiring a cut-and-paste of the entire base class.

If we can do these two easily (scripting level extension and core
subclassing) I'd be confident that we have a promising architecture
that we can build on.  If it turns out that there's no easy way to do
this, we need to revise the base until it can be done.  Then we move
on to solving slightly harder issues like Peter's
"module:PHP::Interpreter" interpretation and how non-existent field
names should be treated.  If it's truly a security question, I think
we need to solve it at the Matcher side rather than at query creation.

My personal litmus test would be how easy it will be to treat
[this-query] as a search expands out to something like [("this query"
WITH BOOST 10)  OR (thisquery WITH BOOST 5) OR (this AND query WITH
PROXIMITY WEIGHTING)].   I'm happy to do this as a text transform that
is fed into a core distributed strict parsing class, or to do this as
a Perl extension that I can later rewrite as a C extension.  But I'd
rather not have to rewrite a parser from scratch to accomplish this.

--nate

Mime
View raw message