lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Trejkaz <trej...@trypticon.org>
Subject Re: Strange change to query parser behaviour in recent versions
Date Sat, 20 Aug 2011 07:34:50 GMT
On Fri, Aug 19, 2011 at 11:05 AM, Chris Hostetter
<hossman_lucene@fucit.org> wrote:
>
> See LUCENE-2458 for the backstory.
>
> the argument was that while phrase queries were historicly generated by
> the query parser when a single (white space deliminated) "chunk" of query
> parser input produced multiple tokens, that logic didn't make sense in CJK
> type langauges where whitespace is not semanticly meaning full to seperate
> "terms"
>
> As i understand it: both [[ 限 定 ]] and [[ 限定 ]] should be treated
> equivilently in asian langauges, so they *both* become BooleanQueries for
> those two words (using the default query operator)

It's odd.  I thought that automatically generating phrase queries was
actually useful specifically *for* CJK languages, as it essentially
allows searching for a "word" as if it is really being tokenised as
one (which of course it isn't.  Not with StandardTokenizer, anyway.)

Since the Javadoc said it wasn't good for all, I assumed it had to be
something more obscure than CJK.  But now I'll have to ask our users
in those countries to see if the old way it works is actually
inconvenient for them.  If it is, we'll probably just adopt the new
way and remove our hack.

> I don't neccessarily agree with the fact that the default was changed, but
> (unless i'm completley missing something) it was changed in a way that
> should be back compatible if you use a consistent Version param on your
> QueryParser instance.

This is true.  QueryParser itself is fine (default aside), it's
StandardQueryParser which currently offers no choice, which is where I
first encountered this surprising behaviour.  In fact, the reason I
discovered it was because we had unit tests parsing Japanese queries
and confirming that they did come back as phrases.  :)

As an aside, Google's behaviour seems to follow the "old" way.  For
instance, [[ 限定 ]] returns 640,000,000 hits and [[ 限 定 ]] returns
772,000,000.  (Interestingly, [[ "限定" ]] returns 643,000,000 hits.
Slightly more than you might expect.)

TX

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message