lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jan Høydahl (Commented) (JIRA) <j...@apache.org>
Subject [jira] [Commented] (SOLR-3085) Fix the dismax/edismax stopwords mm issue
Date Thu, 02 Feb 2012 19:42:53 GMT

    [ https://issues.apache.org/jira/browse/SOLR-3085?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13199147#comment-13199147
] 

Jan Høydahl commented on SOLR-3085:
-----------------------------------

bq. i have a nagging feeling that there are non-stopword cases that would be indistinguishable
(to the parser) from this type of stopword case, and thus would also trigger this logic undesirably,
but i can't articulate what they might be off the top of my head.

A potential difficult one is this multi language example: {{&qf=title_no title_en tags}}.
Each of these fields may have their separate stopwords list, say title_no has a stopword "men"
(norwegian for but) and title_en has stopword "the". Then we query {{q=the men}}. The user
expectation would be that it would return ENGLISH docs matching "men", since "the" is an english
stopword.

Today we'd get:
{noformat}
+((DisjunctionMaxQuery((title_no:the | tags:the)~0.01) DisjunctionMaxQuery((title_en:men |
tags:men)~0.01))~2)
{noformat}

In this case with mm=100% we'd likely get 0 hits, given that "the" is not common in either
of title_no or tags. However, the parser cannot know whether the user's real information need
is "the" or "men" - since both are stopwords for different fields.

Now, all DisMax clauses in this example have had at least one stopword pruned, so using the
"mm decrement" strategy would change mm from 2 to 0 which would turn this into an OR query
- and of course return results. This is a compromise, so a better option in this special case
would probably be to use eDismax's "smart" conditional stopword removal [1], but that requires
change of fieldType.

The "convert to boost query" approach would only work when we have at least one clause without
stop words, since we cannot query ONLY with bq. Say two of my four query terms {{q=the best
cheap holiday}} are stop words, and mm=100%. So we remove the two stop clauses from the BooleanQuery
and reduce mm accordingly from 4 (100%) to 2, and add the two stop clauses as BQs. This approach
would also work for mm<100% cases, since we only count mm clauses from the non-stop clauses.

----
[1] For the special case of all clauses being stop clauses, eDisMax's existing "smart" conditional
stopword handling could perhaps be another solution? For those unfamiliar with it, you can
specify {{&stopwords=true}} (which is the default) and eDismax will remove stopwords for
you instead of letting Analysis do it. It requires that you don't have StopFilterFactory in
your Analysis. Now, if ALL query terms are stopwords, disMax will not remove them, to support
queries like "Who is the who?". (Q: How does edismax pick up which stopword dicationary(ies)
to use?). It's of no use to those removing stop-words in their "index" analysis though.
                
> Fix the dismax/edismax stopwords mm issue
> -----------------------------------------
>
>                 Key: SOLR-3085
>                 URL: https://issues.apache.org/jira/browse/SOLR-3085
>             Project: Solr
>          Issue Type: Bug
>          Components: search
>            Reporter: Jan Høydahl
>              Labels: MinimumShouldMatch, dismax, stopwords
>             Fix For: 3.6, 4.0
>
>
> As discussed here http://search-lucene.com/m/Wr7iz1a95jx and here http://search-lucene.com/m/Yne042qEyCq1
and here http://search-lucene.com/m/RfAp82nSsla DisMax has an issue with stopwords if not
all fields used in QF have exactly same stopword lists.
> Typical solution is to not use stopwords or harmonize stopword lists across all fields
in your QF, or relax the MM to a lower percentag. Sometimes these are not acceptable workarounds,
and we should find a better solution.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

       

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


Mime
View raw message