lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Shawn Heisey (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (LUCENE-6689) Odd analysis problem with WDF, appears to be triggered by preceding analysis components
Date Thu, 20 Aug 2015 18:52:46 GMT

    [ https://issues.apache.org/jira/browse/LUCENE-6689?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14705543#comment-14705543
] 

Shawn Heisey commented on LUCENE-6689:
--------------------------------------

I can work around the specific queries that caused the problem if I make index and query WDF
analysis exactly the same ... but there's a problem even then.

As a test, I entirely removed the query analysis above and removed the "type" attribute from
the index analysis so it applies to both.  I put this fieldType into Solr 5.2.1 and went to
the analysis screen.

A phrase search for "aaa bbb" when the indexed value was "aaa-bbb: ccc" does not match, because
the positions are wrong.  I believe that it *should* match.  A user would most likely expect
it to match.

> Odd analysis problem with WDF, appears to be triggered by preceding analysis components
> ---------------------------------------------------------------------------------------
>
>                 Key: LUCENE-6689
>                 URL: https://issues.apache.org/jira/browse/LUCENE-6689
>             Project: Lucene - Core
>          Issue Type: Bug
>    Affects Versions: 4.8
>            Reporter: Shawn Heisey
>
> This problem shows up for me in Solr, but I believe the issue is down at the Lucene level,
so I've opened the issue in the LUCENE project.  We can move it if necessary.
> I've boiled the problem down to this minimum Solr fieldType:
> {noformat}
>     <fieldType name="testType" class="solr.TextField"
> sortMissingLast="true" positionIncrementGap="100">
>       <analyzer type="index">
>         <tokenizer
> class="org.apache.lucene.analysis.icu.segmentation.ICUTokenizerFactory"
> rulefiles="Latn:Latin-break-only-on-whitespace.rbbi"/>
>         <filter class="solr.PatternReplaceFilterFactory"
>           pattern="^(\p{Punct}*)(.*?)(\p{Punct}*)$"
>           replacement="$2"
>         />
>         <filter class="solr.WordDelimiterFilterFactory"
>           splitOnCaseChange="1"
>           splitOnNumerics="1"
>           stemEnglishPossessive="1"
>           generateWordParts="1"
>           generateNumberParts="1"
>           catenateWords="1"
>           catenateNumbers="1"
>           catenateAll="0"
>           preserveOriginal="1"
>         />
>       </analyzer>
>       <analyzer type="query">
>         <tokenizer
> class="org.apache.lucene.analysis.icu.segmentation.ICUTokenizerFactory"
> rulefiles="Latn:Latin-break-only-on-whitespace.rbbi"/>
>         <filter class="solr.PatternReplaceFilterFactory"
>           pattern="^(\p{Punct}*)(.*?)(\p{Punct}*)$"
>           replacement="$2"
>         />
>         <filter class="solr.WordDelimiterFilterFactory"
>           splitOnCaseChange="1"
>           splitOnNumerics="1"
>           stemEnglishPossessive="1"
>           generateWordParts="1"
>           generateNumberParts="1"
>           catenateWords="0"
>           catenateNumbers="0"
>           catenateAll="0"
>           preserveOriginal="0"
>         />
>       </analyzer>
>     </fieldType>
> {noformat}
> On Solr 4.7, if this type is given the input "aaa-bbb: ccc" then index analysis puts
aaa at term position 1 and bbb at term position 2.  This seems perfectly reasonable to me.
 In Solr 4.9, both terms end up at position 2.  This causes phrase queries which used to work
to return zero hits.  The exact text of the phrase query is in the original documents that
match on 4.7.
> If the custom rbbi (which is included unmodified from the lucene icu analysis source
code) is not used, then the problem doesn't happen, because the punctuation doesn't make it
to the PRF.  If the PatternReplaceFilterFactory is not present, then the problem doesn't happen.
> I can work around the problem by setting luceneMatchVersion to 4.7, but I think the behavior
is a bug, and I would rather not continue to use 4.7 analysis when I upgrade to 5.x, which
I hope to do soon.
> Whether luceneMatchversion is LUCENE_47 or LUCENE_4_9, query analysis puts aaa at term
position 1 and bbb at term position 2.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


Mime
View raw message