lucy-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Marvin Humphrey <mar...@rectangular.com>
Subject [lucy-dev] Re: Stopwords and AND queries
Date Wed, 15 Dec 2010 00:04:50 GMT
cc to lucy-dev...

On Mon, Dec 13, 2010 at 01:17:52PM -0800, Marvin Humphrey wrote:
> On Mon, Dec 13, 2010 at 06:45:08PM +0100, Nick Wellnhofer wrote:
> > 
> > If I run the attached script, I don't get any results. It seems that 
> > stopwords aren't removed from the AND query.
> 
> Thanks for the report.  I've distilled the test down to the sample below, which
> produces an ANDQuery where one clause is a NoMatchQuery.
> 
> The flaw resides somewhere in the interplay of QParser_Expand() and
> QParser_Expand_Leaf().  The stopwords *are* being removed, so that expanding
> the "foo" LeafQuery produces an ANDQuery with no clauses.  That's being
> translated to a NoMatchQuery here, within QParser_Expand_Leaf():
> 
>     if (VA_Get_Size(queries) == 0) {
>         retval = (Query*)NoMatchQuery_new();
>     }
> 
> Somehow, we have to get Expand_Leaf() to differentiate between a clause that
> can't match and a clause should be removed, and to communicate that
> information back to Expand() successfully.  Probably we want to use NULL to
> symbolize a clause that should be removed, but that causes test failures.
> 
> More as I keep digging...

We definitely need Expand() and Expand_Leaf() to distinguish between these two
types of results:

    * A query which will fail to match.
    * A query which neither matches nor fails to match.

Use Case 1: say that we have a QueryParser with no default fields, and with
"heed_colons" enabled.  This query should fail to match...

    field1:foo AND bar

... because the required 'bar' term will have no fields.

Use Case 2: say that we have a QueryParser with a single "content" field,
which has an Analyzer which includes an English stoplist.  At present, the
following query string will fail to match, because of the bug that Nick has
uncovered:

    the AND smiths

The problem is that the term 'the' is a required term thanks to the AND
operator, but since no fields have an Analyzer that produces tokens when fed
the string 'the', the current implementation of Expand_Leaf() returns a
NoMatchQuery.  We need to change that logic so that the return value from
Expand_Leaf() communicates to Expand() "this clause neither matches nor fails
to match."

FWIW, Nick's bug in Use Case 2 arose because of a change which fixed Use 
Case 1. :|

    ------------------------------------------------------------------------
    r6253 | creamyg | 2010-07-31 22:06:50 -0700 (Sat, 31 Jul 2010) | 15 lines

    Fix a bug in QParser_Expand_Leaf().  It's not allowed to return NULL, but
    it was under certain circumstances.  First, if there are no default fields
    (either because an empty array was supplied as the "fields" param to
    QParser_new, or because the Schema has no indexed fields).  Second, if the
    Analyzer returns no tokens after parsing the LeafQuery's text -- e.g. if a
    Stopalizer removes all stopwords, leaving no tokens.  

    This change may reveal latent bugs in people's AND-joined queries.  Where
    a search term was being ignored before, now it will be included as a
    NoMatchQuery, causing AND-ed clauses to return no documents (correctly)
    instead of returning documents which should have been excluded by the
    improperly-ignored required term.

    (Replay of r6089.)

It seems to me that the best way to have Expand_Leaf() communicate to Expand()
that a clause is optional is to return NULL.  That's a little confusing,
because when compiling a Query to a Matcher, a NULL Matcher is synonymous with
"fails to match".  But the only other way I can think of to deal with it would
be to introduce a new class, e.g. "VoidQuery", to carry that information.

Expand() will also have to change so that it can return NULL.  Expand() is
recursive, calling Expand() on child queries.  Because it can't know whether
it is operating on a top-level query or a subquery, it can't know whether it
should translate a NULL returned by Expand_Leaf() into a NoMatchQuery or not.

QParser_Parse(), on the other hand, *does* know that it's operating on a
top-level query.  It can check the return value of Expand() and translate
NULLs to NoMatchQueries.  Therefore, the changes under consideration only
affect the expert APIs (Expand and Expand_Leaf) and not the primary API,
Parse().

The last tricky bit is to changing the internals of Expand_Leaf() so that it
returns NULL and NoMatchQuery as appropriate.  I think we need to use a
boolean to track the special case of Analyzers which didn't output any tokens.
If that boolean is true and there are no concrete child Queries, we return
NULL.

Marvin Humphrey


Mime
View raw message