lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Mindaugas Žakšauskas <>
Subject Re: QueryParser/StopAnalyzer question
Date Mon, 23 May 2011 14:36:48 GMT
Hi Erick,

I think answer to this question depends which hat you put on.

If you put search engine hat (or do similar things in, i.e. Google),
the results will be the same as what Lucene does at the moment. And
that's fair enough - getting more results in search engine world is
almost always better than getting less. Even if a bunch of slightly
irrelevant results is returned, nobody cares.

But if you put a database hat, the world view suddenly changes. I am
sure there are plenty of people who use Lucene in situations where
they need exact matches and any excess results are not desirable.

The root of the evil here is coming from the fact that stopwords are
not indexed and reasonable defaults have to be assumed in different
situations. Thinking of this, to return all data for stopword-only
query would probably be least expected and I don't disagree on your
argument about the mixed case, too.

This probably leaves me with a single option which is not to use
stopwords at all, allowing me to get the best of the both worlds. Does
anyone have any experience on how much of increased index size
(roughly) can I expect?


On Mon, May 23, 2011 at 3:13 PM, Erick Erickson <> wrote:
> Hmmm, somehow I missed this days ago....
> Anyway, the Lucene query parsing process isn't quite Boolean logic.
> I encourage you to think in terms of "required", "optional", and
> "prohibited".
> Both queries are equivalent, to see this try attaching &debugQuery=on
> to your URL and look at the "parsed query" in the debug info....
> Anyway, to your qestion.
> +foo:bar +baz:"there is"
> reads that "bar" must appear in the field "foo". So far so good.
> But it's also required that baz contain the empty clause, which
> is different than saying baz must be empty. One can argue that
> any field contains, by definition, nothing.
> But imagine the impact of what you're requesting. If all stop words
> get removed, then no query would ever match yours. Which
> would be very counter-intuitive IMO. Your users have no clue
> that you've removed stopwords, so they'll sit there saying "Look, I
> KNOW that "bar" was in foo and I KNOW that "there is" was in
> baz, why the heck didn't this cursed system find my doc?
> Anyway, I don't think you really want this behavior in the
> stopword removal case. If you can post some use-cases where this
> would be desirable, maybe we can noodle about a solution....
> Best
> Erick
> 2011/5/23 Mindaugas Žakšauskas <>:
>> Not much luck so far :(
>> Just in case if anyone wants to earn some virtual dosh, I have added
>> some 50 bonus points to this question on StackOverflow:
>> I also promise to post a solution here if anything satisfactory turns up.
>> m.
>> 2011/5/17 Mindaugas Žakšauskas <>:
>>> Hi,
>>> Let's say we have an index having few documents indexed using
>>> StopAnalyzer.ENGLISH_STOP_WORDS_SET. The user issues two queries:
>>> 1) foo:bar
>>> 2) baz:"there is"
>>> Let's assume that the first query yields some results because there
>>> are documents matching that query.
>>> The second query contains two stopwords ("there" and "is") and yields
>>> 0 results. The reason for this is because when baz:"there is" is
>>> parsed, it ends up as a void query as both "there" and "is" are
>>> stopwords (technically speaking, this is converted to an empty
>>> BooleanQuery having no clauses). So far so good.
>>> However, any of the following combined queries
>>> +foo:bar +baz:"there is"
>>> foo:bar AND baz:"there is"
>>> behave exactly the same way as query +foo:bar, that is, brings back
>>> some results. The second AND part which is supposed to yield no
>>> results is completely ignored.
>>> One might argue that when ANDing both conditions have to be met, that
>>> is, documents having foo=bar and baz being empty have to be retrieved,
>>> as when issued seperately, baz:"there is" yields 0 results.
>>> It seem contradictory as an atomic query component has different
>>> impact on the overall query depending on the context. Is there any
>>> logical explanation for this? Can this be addressed in any way,
>>> preferably without writing own QueryAnalyzer?
>>> If this makes any difference, observed behaviour happens under Lucene v3.0.2.
>>> Regards,
>>> Mindaugas
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail:
>> For additional commands, e-mail:
> ---------------------------------------------------------------------
> To unsubscribe, e-mail:
> For additional commands, e-mail:

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message