Mailing-List: contact lucy-dev-help@incubator.apache.org; run by ezmlm
Precedence: bulk
Reply-To: lucy-dev@incubator.apache.org
Received-SPF: pass (nike.apache.org: local policy)
Date: Mon, 18 Apr 2011 12:41:52 -0700
From: Marvin Humphrey <marvin@rectangular.com>
To: lucy-dev@incubator.apache.org
Message-ID: <20110418194152.GB14039@rectangular.com>
References: <D3E8E7A4-36DA-431F-847C-D2391FDDB637@kineticode.com>
 <20110416050218.GA30118@rectangular.com>
 <65A25C24-FAED-433F-B565-A826737B4972@kineticode.com>
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
In-Reply-To: <65A25C24-FAED-433F-B565-A826737B4972@kineticode.com>
User-Agent: Mutt/1.5.18 (2008-05-17)
Subject: Re: [lucy-dev] Simplifying the Query Parser

On Sat, Apr 16, 2011 at 09:01:29AM -0700, David E. Wheeler wrote:
> On Apr 15, 2011, at 10:02 PM, Marvin Humphrey wrote:
> 
> >  1. Generate a TermQuery with field 'foo' and term 'bar'.  This is the
> >     current behavior, which we are ruling out because it makes it hard to
> >     write a secure parser when you have sensitive fields.
> >  2. Treat 'foo' as a distinct term, so that the query is parsed the same as
> >     'foo bar'.
> >  3. Treat 'foo:bar' as a single "leaf", which will then be expanded by
> >     Expand_Leaf() and will be tokenized using field-specific Analyzers.
> >     Most of the time, this will result in a PhraseQuery, as if you had typed
> >     '"foo bar"'.
> >  4. Generate a NoMatchQuery.
> > 
> > Whatever option we choose, I hope that the parser can produce Queries which
> > return sensible results for all of these:
> > 
> >    http://www.apache.org/
> >    mailto:me@example.com
> >    PHP::Interpreter
> >    10:30
> > 
> > (Can others suggest more torture test query strings?)
> 
> Those are great examples. And given those, I think #3 is probably the best
> choice. In all those cases, with the possible exception of mailto:, a phrase
> is what I would expect.

The 'mailto:' example is the only one that's problematic with #4, the
NoMatchQuery option, though.  And it's pretty esoteric -- how many times do
people actually type/paste mailto links into search boxes?

The others are all solvable by tightening up the parser:

  * Currently field names must match /[0-9A-Za-z_]+/.  We should require
    them to be identifiers, i.e. they must not start with a number:
    /[A-Za-z_][0-9A-Za-z_]*/
  * QueryParser should use single-token lookahead to enforce that field name
    constructs must be followed by something sensible.

Schema currently allows field names to be anything at all.  (We probably ought
to tighten up the rules there, too, though not necessarily to require
identifiers.)  However, it doesn't bother me if QueryParser is more finicky.

I think we ought to consider the fact that QueryParser turns
'PHP::Interpreter' into field name 'PHP' and term ':Interpreter' a bug, and
address it with bugfixes rather than a new feature.  The behavior after
applying the fixes will be to produce a phrase -- just as with option #3 --
but the internals of QueryParser will be cleaner.

> > Our QueryParser, unlike the Lucene QueryParser, is primarily designed as a
> > user-facing parser -- it never throws parse errors, it supports only widely
> > popular syntax, etc.  Options 2 and 3 are similar to what you get at Google
> > today[1], and they are in the tolerant spirit of the current design.
> > However, they are somewhat inconsistent from an interface design standpoint,
> > and I worry that that makes QueryParser harder to grok and subclass.
> 
> This is largely a matter of precise documentation and a good API, though,
> yes? 

I wouldn't say so.

First, it's a magical, DWIM behavior.  Writing a subclassed QueryParser
becomes harder when you have to think about what happens when what is supposed
to be a field name suddenly becomes a term.

Second, although we haven't formally defined a BNF grammar for our
QueryParser, we could.  So long as field name constructs are fully determined
by syntax, it's relatively straightforward to reimplement our query parser
language using a parser generator like Lemon or Parse::RecDescent.  However,
if we introduce this fuzzy behavior, such reimplementations get more complex
to write and harder to hack on because the parser has to reach back into the
OO model.

> Also, is there a strict, Lucene-style parser?

Lucy doesn't provide one.  The OO interfaces for our Query offerings largely
fill the need for formal query definition and do a better job of it than a
language descended from search-box syntax ever could.  

> >  0.2.0
> >    * Always heed colons.
> >    * Make QParser_Set_Heed_Colons() a no-op and deprecate it in the
> >      documentation.
> >  0.3.0
> >    * Remove QParser_Set_Heed_Colons().
> 
> And at what point would the application of one of the above four solutions
> be applied? I can see arguments for 0.1.0 and 0.2.0.

This issue should not block 0.1.0, which is almost done.  

We'd probably be voting on 0.1.0 now were it not for the interesting traffic
on lucy-dev diverting our attention. :)

Marvin Humphrey