lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Jack Krupansky (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (SOLR-3589) Edismax parser does not honor mm parameter if analyzer splits a token
Date Thu, 16 Aug 2012 16:41:38 GMT

    [ https://issues.apache.org/jira/browse/SOLR-3589?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13436082#comment-13436082
] 

Jack Krupansky commented on SOLR-3589:
--------------------------------------

Be careful not to confuse dismax and edismax. They are two different query parsers, with different
goals.

One of edismax's goals was to support "fielded queries" (e.g., "title:abc AND date:123") and
the full Lucene query syntax. No typical analyzer will be able to tell you that title and
date are field names.

Not "English-centric", but European/Latin-centric for sure. The edismax and classic Lucene
query parsers share that heritage, based on whitespace, but the dismax query parser doesn't
"suffer" from that same need to parse field names and operators.

There is no question that better query parser support is needed for non-European/Latin languages,
but that requires careful, high-level, overall design, which is a tall order for a fast-paced
open source community where features tend to be looked at in isolation.

One clarification...

bq. assumes that a term is a whitespace-delimited string

Yes and no. We need to be careful about distinguishing a "source term" - what the parser recognizes,
which is whitespace delimited, from "analyzed terms" which are recognized and output by the
field type analyzers. There is no requirement that the output terms be whitespace-delimited
or that the input to an anlyzer be whitespace-delimited. So, the theory has been that even
a whitespace-centric complex-structure query parser can also handle, for example, Chinese
text. Obviously that hasn't worked out as cleanly as desired and more work is needed.

                
> Edismax parser does not honor mm parameter if analyzer splits a token
> ---------------------------------------------------------------------
>
>                 Key: SOLR-3589
>                 URL: https://issues.apache.org/jira/browse/SOLR-3589
>             Project: Solr
>          Issue Type: Bug
>          Components: search
>    Affects Versions: 3.6
>            Reporter: Tom Burton-West
>
> With edismax mm set to 100%  if one of the tokens is split into two tokens by the analyzer
chain (i.e. "fire-fly"  => fire fly), the mm parameter is ignored and the equivalent of
 OR query for "fire OR fly" is produced.
> This is particularly a problem for languages that do not use white space to separate
words such as Chinese or Japenese.
> See these messages for more discussion:
> http://lucene.472066.n3.nabble.com/edismax-parser-ignores-mm-parameter-when-tokenizer-splits-tokens-hypenated-words-WDF-splitting-etc-tc3991911.html
> http://lucene.472066.n3.nabble.com/edismax-parser-ignores-mm-parameter-when-tokenizer-splits-tokens-i-e-CJK-tc3991438.html
> http://lucene.472066.n3.nabble.com/Why-won-t-dismax-create-multiple-DisjunctionMaxQueries-when-autoGeneratePhraseQueries-is-false-tc3992109.html

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


Mime
View raw message