incubator-lucy-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Marvin Humphrey <mar...@rectangular.com>
Subject Re: [lucy-dev] Re: [KinoSearch] Stopwords and AND queries
Date Thu, 16 Dec 2010 23:48:33 GMT
On Thu, Dec 16, 2010 at 12:04:30PM -0800, Nathan Kurz wrote:
> I've only glanced at this, but neither NULL nor a VoidQuery really
> seems like the actual solution here.  

Thanks for the high-level perspective, Nate.  In light of your comments, I
definitely feel that this merits a targeted, non-public bugfix and that adding
a new VoidQuery API would be inappropriate.

> If I search for "The Smiths", it means I'm searching for that term.
> If "The" is stop listed, there is simply no way to answer a query that
> uses it with anything but an error message.  

True, but we'll have to give them broken results anyway. :(  It's better to
provide the results for "Smiths" than no results at all.

> It seems like there also needs to be treatment at the QueryParser level to
> reject or modify queries that attempt to use stop terms.  

Point taken.  I think that would be a useful feature for a query parser.

However, the current QueryParser's design does not allow for it.  QueryParser
is not aware of stop words -- it only knows about Analyzers.  Processing the
removal of stop words at the QueryParser level (so that QueryParser could
report which stop words were removed) would require piercing at least three
levels of encapsulation.

It wouldn't be hard to build a custom query parser that processed stop words
early on, adapting the recipe in Lucy::Docs::Cookbook::CustomQueryParser.  I
think that needs to be the answer.

> More generally, it seems like Stop Lists themselves should be discouraged as
> a shortcut from earlier times when disk storage was at a premium.

We've been discouraging stoplists ever since KinoSearch 0.05.  Unlike Lucene's
"StandardAnalyzer", our PolyAnalyzer does not remove stopwords by default.  I
used '"the smiths"' and '"the who"' to illustrate why stoplists suck in my
2006 OSCON presentation.

Nevertheless, index size is still a major concern, and stoplists have their
place.  Disk is cheap, but RAM is expensive -- and if you want to run search
clusters under very heavy load, you need indexes that can fit into the OS
cache.  Stoplists can help with that.

> Which is to say, I think the current behaviour is correct.  If you
> manage to get a query through asking for a stop listed term, the
> answer is that it is not there, whether in a phrase or a AND. Courtesy
> says that you would return an error message or correct the query, but
> this should be handled by the front end and not by the index proper.

I agree that this would be the best behavior.  It can be achieved by
subclassing QueryParser and overriding Expand_Leaf().

> ps.   If you still feel you need to act, I think you need something
> like a static StopTerm and to allow the Boolean query classes to
> decide how they want to treat this.  But I'd recommend against adding
> this complexity unless you're certain it's a real problem that can't
> be handled as interface.

I would oppose adding such complexity elsewhere for the sake of stoplists.
Stoplists are a lousy band aid to begin with.  It would be a poor engineering
compromise to saddle crucial classes like ANDQuery with special-case code when
you're still never going to be able to search for '"the who"'.

Marvin Humphrey


Mime
View raw message