lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Tom Burton-West (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (SOLR-3589) Edismax parser does not honor mm parameter if analyzer splits a token
Date Wed, 07 Nov 2012 19:02:14 GMT

    [ https://issues.apache.org/jira/browse/SOLR-3589?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13492591#comment-13492591
] 

Tom Burton-West commented on SOLR-3589:
---------------------------------------

Hi Robert,

I just put the backport to 3.6 up on our test server and pointed it to one of our production
shards.  The improvement for Chinese queries  are dramatic.  (Especially for longer queries
like the TREC 5 queries, see examples below)

When you have time, please look over the backport of the patch.  I think it is fine but I
would appreciate you looking it over.  My understanding of your patch is that it just affects
a small portion of the edismax logic, but I don't understand the edismax parser well enough
to be sure there isn't some difference between 3.6 and 4.0 that I didn't account for in the
patch.

Thanks for working on this.   Naomi and I are both very excited about this bug finally being
fixed and want to put the fix into production soon.
---
Example TREC 5 Chinese queries:

<num> Number: CH4
<E-title> The newly discovered oil fields in China.
<C-title> 中国大陆新发现的油田   
40,135 items found for 中国大陆新发现的油田 with current implementation (due to
dismax bug)
78 items found for 中国大陆新发现的油田 with patch

<num> Number: CH10
<E-title> Border Trade in Xinjiang
<C-title> 新疆的边境贸易  
20,249 items found for 新疆的边境贸易  current implementation (with bug)
243 items found for 新疆的边境贸易      with patch.

                
> Edismax parser does not honor mm parameter if analyzer splits a token
> ---------------------------------------------------------------------
>
>                 Key: SOLR-3589
>                 URL: https://issues.apache.org/jira/browse/SOLR-3589
>             Project: Solr
>          Issue Type: Bug
>          Components: search
>    Affects Versions: 3.6, 4.0-BETA
>            Reporter: Tom Burton-West
>            Assignee: Robert Muir
>         Attachments: SOLR-3589-3.6.PATCH, SOLR-3589.patch, SOLR-3589.patch, SOLR-3589.patch,
SOLR-3589.patch, SOLR-3589.patch, SOLR-3589_test.patch, testSolr3589.xml.gz, testSolr3589.xml.gz
>
>
> With edismax mm set to 100%  if one of the tokens is split into two tokens by the analyzer
chain (i.e. "fire-fly"  => fire fly), the mm parameter is ignored and the equivalent of
 OR query for "fire OR fly" is produced.
> This is particularly a problem for languages that do not use white space to separate
words such as Chinese or Japenese.
> See these messages for more discussion:
> http://lucene.472066.n3.nabble.com/edismax-parser-ignores-mm-parameter-when-tokenizer-splits-tokens-hypenated-words-WDF-splitting-etc-tc3991911.html
> http://lucene.472066.n3.nabble.com/edismax-parser-ignores-mm-parameter-when-tokenizer-splits-tokens-i-e-CJK-tc3991438.html
> http://lucene.472066.n3.nabble.com/Why-won-t-dismax-create-multiple-DisjunctionMaxQueries-when-autoGeneratePhraseQueries-is-false-tc3992109.html

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


Mime
View raw message