lucy-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Mattmann, Chris A (388J)" <chris.a.mattm...@jpl.nasa.gov>
Subject Re: [lucy-dev] Simplifying the Query Parser
Date Sat, 16 Apr 2011 04:36:51 GMT
+1 from me...

Cheers,
Chris

On Apr 15, 2011, at 3:41 PM, David E. Wheeler wrote:

> Lucyans,
> 
> Marvin and I were just discussing the QueryParser on IRC. Years ago, I reported a bug
in the KinoSearch query parser:
> 
>  http://www.rectangular.com/pipermail/kinosearch/2006-May/004992.html
> 
> Basically, if I searched on "PHP::Interpreter", the parser died. Marvin fixed this bug,
and I think partly as a result of this, introduced the `heed_colons` attribute that persists
today in Luncy::Search::QueryParser. But as I understand it, `heed_colons` has three issues:
> 
> 1. It adds complexity to the parser (simpler is better).
> 2. It has a security vulnerability: If a user searches on "secret_field:foo", it will
search only secret_field, and you might not want that.
> 3. If a field doesn't exist, the results may be meaningless.
> 
> In discussing these issues with Marvin, he expressed a strong desire not to get into
QueryParser wars, and I can understand that. I think that one of the strengths of Lucy is
that the default QueryParser offers a decent 80% solution for most users, while offering the
power of toolkit hackers to do even more. With that in mind, I think we've come up with a
solution to the above issues that actually *simplifies* QP a bit:
> 
> * Deprecate heed_colons. Always heed colons.
> * If you search for "foo:bar" and the field "foo" doesn't exist or is not public, treat
it as a term.
> 
> So addressing the above three points, this change would:
> 
> 1. Remove complexity (or at least deprecate it)
> 2. Prevent private fields from being searched
> 3. Return relevant results when a colon term does not match a public field.
> 
> As a result "module:PHP::Interpreter" will properly search "PHP OR Interpreter IN module"
and "PHP::Interpreter" will search "PHP OR Interpreter", and "secret_filed:whatever" will
search "secret OR field OR "whatever".
> 
> Thoughts?
> 
> Best,
> 
> David
> 


++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Chris Mattmann, Ph.D.
Senior Computer Scientist
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 171-266B, Mailstop: 171-246
Email: chris.a.mattmann@nasa.gov
WWW:   http://sunset.usc.edu/~mattmann/
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Adjunct Assistant Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++


Mime
View raw message