lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Erick Erickson <erickerick...@gmail.com>
Subject Re: QueryParser/StopAnalyzer question
Date Sun, 29 May 2011 02:24:52 GMT
Sorry, I was at Lucene Revolution, and out of circulation for a while.

I agree with your point that "it depends" (what correct behavior is).
Seems like a lot of Solr/Lucene has that answer...

But as to stop words, the trend lately is to NOT remove them
anyway. The whole stop word issue comes from a long time
ago when disk space was much more dear.

As to how much it increases the index size, I don't have
a clue off the top of my head...

Best
Erick

2011/5/23 Mindaugas Žakšauskas <mindas@gmail.com>:
> Hi Erick,
>
> I think answer to this question depends which hat you put on.
>
> If you put search engine hat (or do similar things in, i.e. Google),
> the results will be the same as what Lucene does at the moment. And
> that's fair enough - getting more results in search engine world is
> almost always better than getting less. Even if a bunch of slightly
> irrelevant results is returned, nobody cares.
>
> But if you put a database hat, the world view suddenly changes. I am
> sure there are plenty of people who use Lucene in situations where
> they need exact matches and any excess results are not desirable.
>
> The root of the evil here is coming from the fact that stopwords are
> not indexed and reasonable defaults have to be assumed in different
> situations. Thinking of this, to return all data for stopword-only
> query would probably be least expected and I don't disagree on your
> argument about the mixed case, too.
>
> This probably leaves me with a single option which is not to use
> stopwords at all, allowing me to get the best of the both worlds. Does
> anyone have any experience on how much of increased index size
> (roughly) can I expect?
>
> Regards,
> Mindaugas
>
> On Mon, May 23, 2011 at 3:13 PM, Erick Erickson <erickerickson@gmail.com> wrote:
>> Hmmm, somehow I missed this days ago....
>>
>> Anyway, the Lucene query parsing process isn't quite Boolean logic.
>> I encourage you to think in terms of "required", "optional", and
>> "prohibited".
>>
>> Both queries are equivalent, to see this try attaching &debugQuery=on
>> to your URL and look at the "parsed query" in the debug info....
>>
>> Anyway, to your qestion.
>> +foo:bar +baz:"there is"
>>
>> reads that "bar" must appear in the field "foo". So far so good.
>> But it's also required that baz contain the empty clause, which
>> is different than saying baz must be empty. One can argue that
>> any field contains, by definition, nothing.
>>
>> But imagine the impact of what you're requesting. If all stop words
>> get removed, then no query would ever match yours. Which
>> would be very counter-intuitive IMO. Your users have no clue
>> that you've removed stopwords, so they'll sit there saying "Look, I
>> KNOW that "bar" was in foo and I KNOW that "there is" was in
>> baz, why the heck didn't this cursed system find my doc?
>>
>> Anyway, I don't think you really want this behavior in the
>> stopword removal case. If you can post some use-cases where this
>> would be desirable, maybe we can noodle about a solution....
>>
>> Best
>> Erick
>>
>>
>> 2011/5/23 Mindaugas Žakšauskas <mindas@gmail.com>:
>>> Not much luck so far :(
>>>
>>> Just in case if anyone wants to earn some virtual dosh, I have added
>>> some 50 bonus points to this question on StackOverflow:
>>>
>>> http://stackoverflow.com/questions/6H044061/lucene-query-parsing-behaviour-joining-query-parts-with-and
>>>
>>> I also promise to post a solution here if anything satisfactory turns up.
>>>
>>> m.
>>>
>>> 2011/5/17 Mindaugas Žakšauskas <mindas@gmail.com>:
>>>> Hi,
>>>>
>>>> Let's say we have an index having few documents indexed using
>>>> StopAnalyzer.ENGLISH_STOP_WORDS_SET. The user issues two queries:
>>>> 1) foo:bar
>>>> 2) baz:"there is"
>>>>
>>>> Let's assume that the first query yields some results because there
>>>> are documents matching that query.
>>>>
>>>> The second query contains two stopwords ("there" and "is") and yields
>>>> 0 results. The reason for this is because when baz:"there is" is
>>>> parsed, it ends up as a void query as both "there" and "is" are
>>>> stopwords (technically speaking, this is converted to an empty
>>>> BooleanQuery having no clauses). So far so good.
>>>>
>>>> However, any of the following combined queries
>>>>
>>>> +foo:bar +baz:"there is"
>>>> foo:bar AND baz:"there is"
>>>>
>>>> behave exactly the same way as query +foo:bar, that is, brings back
>>>> some results. The second AND part which is supposed to yield no
>>>> results is completely ignored.
>>>>
>>>> One might argue that when ANDing both conditions have to be met, that
>>>> is, documents having foo=bar and baz being empty have to be retrieved,
>>>> as when issued seperately, baz:"there is" yields 0 results.
>>>>
>>>> It seem contradictory as an atomic query component has different
>>>> impact on the overall query depending on the context. Is there any
>>>> logical explanation for this? Can this be addressed in any way,
>>>> preferably without writing own QueryAnalyzer?
>>>>
>>>> If this makes any difference, observed behaviour happens under Lucene v3.0.2.
>>>>
>>>> Regards,
>>>> Mindaugas
>>>>
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>>
>>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>
>>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message