lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Robert Muir <rcm...@gmail.com>
Subject Re: [jira] Commented: (LUCENE-2458) queryparser shouldn't generate phrasequeries based on term count
Date Sun, 23 May 2010 16:42:37 GMT
These comments lead me to believe you don't understand the issue.

Do you understand that *ALL* CJK queries are made into phrase queries,
regardless of tokenizer?!!?!?!

On Sun, May 23, 2010 at 12:38 PM, Uwe Schindler <uwe@thetaphi.de> wrote:
> Same here, as already noted in the issue.
>
>
>
> Uwe
>
>
>
> -----
>
> Uwe Schindler
>
> H.-H.-Meier-Allee 63, D-28213 Bremen
>
> http://www.thetaphi.de
>
> eMail: uwe@thetaphi.de
>
>
>
> From: Shai Erera [mailto:serera@gmail.com]
> Sent: Sunday, May 23, 2010 6:34 PM
>
> To: dev@lucene.apache.org
> Subject: Re: [jira] Commented: (LUCENE-2458) queryparser shouldn't generate
> phrasequeries based on term count
>
>
>
> Robert - is the effect on scoring also on English and other European
> languages? Or is it mostly for ngram-based languages, and especially CJK?
>
> I want to stress that not all ngram-based languages are affected by this
> behavior, especially those for which we do ngram just because of a lack of
> good tokenizer.
>
> That's why I'm not sure the default should be changed and I'm all for a
> getter/setter. If however it turns out the default MUST be changed, then I
> support the Version + getter/setter approach.
>
> Shai
>
> On Sun, May 23, 2010 at 6:00 PM, Uwe Schindler (JIRA) <jira@apache.org>
> wrote:
>
>    [
> https://issues.apache.org/jira/browse/LUCENE-2458?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12870410#action_12870410
> ]
>
> Uwe Schindler commented on LUCENE-2458:
> ---------------------------------------
>
> Hi Robert,
>
> I also agree with Mark (as you know). We can have both:
> - Version for a good default (3.1 will get the new non-phrase-query
> behavior)
> - A separate getsetter for this option
> (set/getCreatePhraseQueryOnConcenattedTerms or whatever)
>
> This would give you the best from both worlds.
>
>> queryparser shouldn't generate phrasequeries based on term count
>> ----------------------------------------------------------------
>>
>>                 Key: LUCENE-2458
>>                 URL: https://issues.apache.org/jira/browse/LUCENE-2458
>>             Project: Lucene - Java
>>          Issue Type: Bug
>>          Components: QueryParser
>>            Reporter: Robert Muir
>>            Assignee: Robert Muir
>>            Priority: Blocker
>>             Fix For: 3.1, 4.0
>>
>>         Attachments: LUCENE-2458.patch, LUCENE-2458.patch
>>
>>
>> The current method in the queryparser to generate phrasequeries is wrong:
>> The Query Syntax documentation
>> (http://lucene.apache.org/java/3_0_1/queryparsersyntax.html) states:
>> {noformat}
>> A Phrase is a group of words surrounded by double quotes such as "hello
>> dolly".
>> {noformat}
>> But as we know, this isn't actually true.
>> Instead the terms are first divided on whitespace, then the analyzer term
>> count is used as some sort of "heuristic" to determine if its a phrase query
>> or not.
>> This assumption is a disaster for languages that don't use whitespace
>> separation: CJK, compounding European languages like German, Finnish, etc.
>> It also
>> makes it difficult for people to use n-gram analysis techniques. In these
>> cases you get bad relevance (MAP improves nearly *10x* if you use a
>> PositionFilter at query-time to "turn this off" for chinese).
>> For even english, this undocumented behavior is bad. Perhaps in some cases
>> its being abused as some heuristic to "second guess" the tokenizer and piece
>> back things it shouldn't have split, but for large collections, doing things
>> like generating phrasequeries because StandardTokenizer split a compound on
>> a dash can cause serious performance problems. Instead people should analyze
>> their text with the appropriate methods, and QueryParser should only
>> generate phrase queries when the syntax asks for one.
>> The PositionFilter in contrib can be seen as a workaround, but its pretty
>> obscure and people are not familiar with it. The result is we have bad
>> out-of-box behavior for many languages, and bad performance for others on
>> some inputs.
>> I propose instead that we change the grammar to actually look for double
>> quotes to determine when to generate a phrase query, consistent with the
>> documentation.
>
> --
> This message is automatically generated by JIRA.
> -
> You can reply to this email to add a comment to the issue online.
>
>
> ---------------------------------------------------------------------
>
> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
>
> For additional commands, e-mail: dev-help@lucene.apache.org
>
>



-- 
Robert Muir
rcmuir@gmail.com

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


Mime
View raw message