lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Jack Krupansky (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (SOLR-3589) Edismax parser does not honor mm parameter if analyzer splits a token
Date Thu, 16 Aug 2012 15:47:38 GMT

    [ https://issues.apache.org/jira/browse/SOLR-3589?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13436045#comment-13436045
] 

Jack Krupansky commented on SOLR-3589:
--------------------------------------

bq.  If I use a smart Chinese tokenizer to split up a Chinese sentence into words, why can't
the query parser treat those words exactly the same way it treats words from an English sentence?

Indexing of whole documents can in fact treat text as if it were words from an English sentence,
and split tokens do in fact behave as such in that context, but a query is not an English
sentence or sentence in any natural language. Rather, a query is a structured expression composed
of terms and operators, typically separated by whitespace or special operators such as parentheses.
Portions of queries may look like natural language phrases or even whole sentences, but in
reality they are sequences of terms and operators.

In addition to being parsed according to the syntax of queries, as opposed to natural language
processing or the raw token stream processing of an indexer, each of the query terms must
be "analyzed" before the final form of the term can be generated into a Lucene Query structure.
That analysis is performed separate form the "parsing" of the structured user query expression.
That means that the processing of sub-terms that result from analysis is handled at a different
level than source-level query terms that happen to "look" like English words. In other words,
the sub-terms are processed by the "query generator" while the source terms were processed
by the "query parser". We loosely refer to the combination of (user) query parsing and (Lucene)
query generation as "the query parser", but it is important to distinguish (user query) "parsing"
from (Lucene Query) "generation".

The query parser does its best to handle sub-terms reasonably, but expecting that they will
magically handled the same exact way as source terms is somewhat impractical. That doesn't
mean that there can't be improvement, but simply that a dose of realism is needed when considering
the potential, challenges, and limits of query parsing/processing/generation.

                
> Edismax parser does not honor mm parameter if analyzer splits a token
> ---------------------------------------------------------------------
>
>                 Key: SOLR-3589
>                 URL: https://issues.apache.org/jira/browse/SOLR-3589
>             Project: Solr
>          Issue Type: Bug
>          Components: search
>    Affects Versions: 3.6
>            Reporter: Tom Burton-West
>
> With edismax mm set to 100%  if one of the tokens is split into two tokens by the analyzer
chain (i.e. "fire-fly"  => fire fly), the mm parameter is ignored and the equivalent of
 OR query for "fire OR fly" is produced.
> This is particularly a problem for languages that do not use white space to separate
words such as Chinese or Japenese.
> See these messages for more discussion:
> http://lucene.472066.n3.nabble.com/edismax-parser-ignores-mm-parameter-when-tokenizer-splits-tokens-hypenated-words-WDF-splitting-etc-tc3991911.html
> http://lucene.472066.n3.nabble.com/edismax-parser-ignores-mm-parameter-when-tokenizer-splits-tokens-i-e-CJK-tc3991438.html
> http://lucene.472066.n3.nabble.com/Why-won-t-dismax-create-multiple-DisjunctionMaxQueries-when-autoGeneratePhraseQueries-is-false-tc3992109.html

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


Mime
View raw message