incubator-lucy-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Marvin Humphrey <mar...@rectangular.com>
Subject Re: [lucy-dev] Simplifying the Query Parser
Date Sat, 16 Apr 2011 05:02:18 GMT
On Fri, Apr 15, 2011 at 03:41:22PM -0700, David E. Wheeler wrote:
> * If you search for "foo:bar" and the field "foo" doesn't exist or is not
>   public, treat it as a term.

I certainly think we need to tighten up the 'foo:bar' behavior so that field
names will only be allowed if they are in the whitelist set at QueryParser
construction time by the 'fields' parameter:

     * @param fields The names of the fields which will be searched against.
     * Defaults to those fields which are defined as indexed in the supplied
     * Schema.

The question then becomes what we do with the query string 'foo:bar' when
'foo' isn't in the whitelist.  I can think of four possible options, one of which
we are ruling out.

  1. Generate a TermQuery with field 'foo' and term 'bar'.  This is the
     current behavior, which we are ruling out because it makes it hard to
     write a secure parser when you have sensitive fields.
  2. Treat 'foo' as a distinct term, so that the query is parsed the same as
     'foo bar'.
  3. Treat 'foo:bar' as a single "leaf", which will then be expanded by
     Expand_Leaf() and will be tokenized using field-specific Analyzers.
     Most of the time, this will result in a PhraseQuery, as if you had typed
     '"foo bar"'.
  4. Generate a NoMatchQuery.

Whatever option we choose, I hope that the parser can produce Queries which
return sensible results for all of these:

    http://www.apache.org/
    mailto:me@example.com
    PHP::Interpreter
    10:30

(Can others suggest more torture test query strings?)

Option 4 -- generate a NoMatchQuery -- is obviously the right choice for a
strict parser, but I'm not sure it's right for the Lucy QueryParser.  I just
don't see how it can be made to work with 'mailto:me@example.com'.

Our QueryParser, unlike the Lucene QueryParser, is primarily designed as a
user-facing parser -- it never throws parse errors, it supports only widely
popular syntax, etc.  Options 2 and 3 are similar to what you get at Google
today[1], and they are in the tolerant spirit of the current design.
However, they are somewhat inconsistent from an interface design standpoint,
and I worry that that makes QueryParser harder to grok and subclass.

> 1. Remove complexity (or at least deprecate it)

Regarding deprecation:

Building consensus now as to what the design should be is cool.  Though we
don't have an official feature freeze in place, IMO implementing it should
wait until after 0.1.0.  So we're talking about a change targeted for 0.2.0.

Since Lucy is officially API-unstable, we have the freedom to change things
right away, and I think that's desirable in this case.  The only thing is that
if we can make it possible to upgrade Lucy across one minor version underneath
a live app with no downtime, we should. Here's what I'd suggest:

  0.2.0
    * Always heed colons.
    * Make QParser_Set_Heed_Colons() a no-op and deprecate it in the
      documentation.
  0.3.0
    * Remove QParser_Set_Heed_Colons().

Marvin Humphrey

[1] Compare searches for 'foo:bar' <http://www.google.com/search?q=foo%3Abar>
    and 'define:bar' <http://www.google.com/search?q=define%3Abar>.


Mime
View raw message