Return-Path: Delivered-To: apmail-incubator-lucy-dev-archive@www.apache.org Received: (qmail 34530 invoked from network); 18 Apr 2011 19:46:59 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.3) by minotaur.apache.org with SMTP; 18 Apr 2011 19:46:59 -0000 Received: (qmail 53784 invoked by uid 500); 18 Apr 2011 19:46:59 -0000 Delivered-To: apmail-incubator-lucy-dev-archive@incubator.apache.org Received: (qmail 53758 invoked by uid 500); 18 Apr 2011 19:46:59 -0000 Mailing-List: contact lucy-dev-help@incubator.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: lucy-dev@incubator.apache.org Delivered-To: mailing list lucy-dev@incubator.apache.org Received: (qmail 53750 invoked by uid 99); 18 Apr 2011 19:46:59 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 18 Apr 2011 19:46:59 +0000 X-ASF-Spam-Status: No, hits=-0.0 required=5.0 tests=SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (nike.apache.org: local policy) Received: from [68.116.39.62] (HELO rectangular.com) (68.116.39.62) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 18 Apr 2011 19:46:51 +0000 Received: from marvin by rectangular.com with local (Exim 4.69) (envelope-from ) id 1QBuKW-0003gM-Hw for lucy-dev@incubator.apache.org; Mon, 18 Apr 2011 12:41:52 -0700 Date: Mon, 18 Apr 2011 12:41:52 -0700 From: Marvin Humphrey To: lucy-dev@incubator.apache.org Message-ID: <20110418194152.GB14039@rectangular.com> References: <20110416050218.GA30118@rectangular.com> <65A25C24-FAED-433F-B565-A826737B4972@kineticode.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <65A25C24-FAED-433F-B565-A826737B4972@kineticode.com> User-Agent: Mutt/1.5.18 (2008-05-17) X-Virus-Checked: Checked by ClamAV on apache.org Subject: Re: [lucy-dev] Simplifying the Query Parser On Sat, Apr 16, 2011 at 09:01:29AM -0700, David E. Wheeler wrote: > On Apr 15, 2011, at 10:02 PM, Marvin Humphrey wrote: > > > 1. Generate a TermQuery with field 'foo' and term 'bar'. This is the > > current behavior, which we are ruling out because it makes it hard to > > write a secure parser when you have sensitive fields. > > 2. Treat 'foo' as a distinct term, so that the query is parsed the same as > > 'foo bar'. > > 3. Treat 'foo:bar' as a single "leaf", which will then be expanded by > > Expand_Leaf() and will be tokenized using field-specific Analyzers. > > Most of the time, this will result in a PhraseQuery, as if you had typed > > '"foo bar"'. > > 4. Generate a NoMatchQuery. > > > > Whatever option we choose, I hope that the parser can produce Queries which > > return sensible results for all of these: > > > > http://www.apache.org/ > > mailto:me@example.com > > PHP::Interpreter > > 10:30 > > > > (Can others suggest more torture test query strings?) > > Those are great examples. And given those, I think #3 is probably the best > choice. In all those cases, with the possible exception of mailto:, a phrase > is what I would expect. The 'mailto:' example is the only one that's problematic with #4, the NoMatchQuery option, though. And it's pretty esoteric -- how many times do people actually type/paste mailto links into search boxes? The others are all solvable by tightening up the parser: * Currently field names must match /[0-9A-Za-z_]+/. We should require them to be identifiers, i.e. they must not start with a number: /[A-Za-z_][0-9A-Za-z_]*/ * QueryParser should use single-token lookahead to enforce that field name constructs must be followed by something sensible. Schema currently allows field names to be anything at all. (We probably ought to tighten up the rules there, too, though not necessarily to require identifiers.) However, it doesn't bother me if QueryParser is more finicky. I think we ought to consider the fact that QueryParser turns 'PHP::Interpreter' into field name 'PHP' and term ':Interpreter' a bug, and address it with bugfixes rather than a new feature. The behavior after applying the fixes will be to produce a phrase -- just as with option #3 -- but the internals of QueryParser will be cleaner. > > Our QueryParser, unlike the Lucene QueryParser, is primarily designed as a > > user-facing parser -- it never throws parse errors, it supports only widely > > popular syntax, etc. Options 2 and 3 are similar to what you get at Google > > today[1], and they are in the tolerant spirit of the current design. > > However, they are somewhat inconsistent from an interface design standpoint, > > and I worry that that makes QueryParser harder to grok and subclass. > > This is largely a matter of precise documentation and a good API, though, > yes? I wouldn't say so. First, it's a magical, DWIM behavior. Writing a subclassed QueryParser becomes harder when you have to think about what happens when what is supposed to be a field name suddenly becomes a term. Second, although we haven't formally defined a BNF grammar for our QueryParser, we could. So long as field name constructs are fully determined by syntax, it's relatively straightforward to reimplement our query parser language using a parser generator like Lemon or Parse::RecDescent. However, if we introduce this fuzzy behavior, such reimplementations get more complex to write and harder to hack on because the parser has to reach back into the OO model. > Also, is there a strict, Lucene-style parser? Lucy doesn't provide one. The OO interfaces for our Query offerings largely fill the need for formal query definition and do a better job of it than a language descended from search-box syntax ever could. > > 0.2.0 > > * Always heed colons. > > * Make QParser_Set_Heed_Colons() a no-op and deprecate it in the > > documentation. > > 0.3.0 > > * Remove QParser_Set_Heed_Colons(). > > And at what point would the application of one of the above four solutions > be applied? I can see arguments for 0.1.0 and 0.2.0. This issue should not block 0.1.0, which is almost done. We'd probably be voting on 0.1.0 now were it not for the interesting traffic on lucy-dev diverting our attention. :) Marvin Humphrey