incubator-lucy-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Marvin Humphrey <mar...@rectangular.com>
Subject Re: [lucy-dev] Re: [KinoSearch] Stopwords and AND queries
Date Wed, 15 Dec 2010 20:08:26 GMT
On Wed, Dec 15, 2010 at 04:28:54PM +0100, Nick Wellnhofer wrote:
> I only had a cursory glance at the code and it seems that returning NULL 
> is the easiest approach though it looks a bit hackish. Introducing a new 
> VoidQuery class is probably the cleanest solution but I guess it 
> requires a lot more additional code.

I wouldn't mind adding VoidQuery if I thought it would be useful outside of
this context.  But I don't think it will.

There's another way: we can add a "fails_to_match" attribute to NoMatchQuery.
It would default to true, preserving NoMatchQuery's current behavior.
However, we can have QParser_Expand_Leaf() set "fails_to_match" to false,
allowing QParser_Expand() to know when it can safely drop a clause from an
ANDQuery.

There's still ugliness in that approach, but the ugliness gets buried in
places people don't look very often -- instead of a new top-level class, we
add an obscure attribute to a reasonably obscure class.  Only QParser_Expand()
and QParser_Expand_Leaf() need to know this information.  Even QParser_Parse()
doesn't care, because it makes no difference whether a top-level NoMatchQuery
has "fails to match" or "neither matches nor fails to match" semantics.  A
search for a stopword on its own...

    the

... still returns no documents.

> Another solution might be to prune stop words in a separate pass over 
> the query string.

Unfortunately, I don't think that's feasible.  

QueryParser has two-phase tokenization: the first phase extracts leaves (terms
and phrases) and QueryParser-specific tokens like...

   + - ( ) AND NOT OR 

... but it applies no field-specific processing.  The second tokenization
phase runs once for each field, and uses the Analyzer specific to that field
(which it gets from the Schema).  

We can't use any of the field-specific Analyzers during the first phase
because that tokenization is shared across all fields -- and futhermore, such
an Analyzer would have to be specially tuned to extract QueryParser-specific
tokens properly.  We can't run the Analyzer twice during the second phase
because we don't want to e.g. stem something that's already stemmed.

Marvin Humphrey


Mime
View raw message