lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Robert Muir (Commented) (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (LUCENE-3821) search slop problem introduced somewhere between Solr 1.4 and Solr 3.5
Date Thu, 23 Feb 2012 22:53:48 GMT

    [ https://issues.apache.org/jira/browse/LUCENE-3821?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13215154#comment-13215154
] 

Robert Muir commented on LUCENE-3821:
-------------------------------------

Here's Naomi's original description (moved from description section): including the queries,
however I can't 
reproduce with just that phrase of the document (I tried, it passes).

I think its a more complex bug in SloppyPhraseScorer... and I can reproduce similar behavior
with another test.

{noformat}
In upgrading from Solr 1.4 to Solr 3.5, the following phrase searches stopped working in dismax:
"The Beatles as musicians : Revolver through the Anthology"
"Color-blindness [print/digital]; its dangers and its detection"
Both of these queries have a repeated work, and have many terms. It's not the number of terms
or the colon surrounded by spaces, because the following phrase search works in Solr 3.5 (and
Solr 1.4):
"International encyclopedia of revolution and protest : 1500 to the present"

With Robert Muir's help, we have narrowed the problem down to slop (proximity in lucene QueryParser,
query slop in dismax). I have included debugQuery details for the Beatles search; I confirmed
the same behavior with the color-blindness search.

Solr 3.5: it fails when (query) slop setting isn't 0.

lucene QueryParser with proximity set to 1 (or anything > 0) : no match
URL: q=all_search:"The Beatles as musicians : Revolver through the Anthology"~1
final query: all_search:"the beatl as musician revolv through the antholog"~1

lucene QueryParser with proximity set to 0: result!
URL: q=all_search:"The Beatles as musicians : Revolver through the Anthology"
final query: all_search:"the beatl as musician revolv through the antholog"

6.0562754 = (MATCH) weight(all_search:"the beatl as musician revolv through the antholog"
in 1064395), product of:
<snip>
48.450203 = idf(all_search: the=3531140 beatl=398 as=645923 musician=11805 revolv=872 through=81366
the=3531140 antholog=11611)
<snip>

dismax QueryParser with qs=1: no match
ps=0
URL: qf=all_search&pf=all_search&q="The Beatles as musicians : Revolver through the
Anthology"&qs=1&ps=0
final query: +(all_search:"the beatl as musician revolv through the antholog"~1)~0.01 (all_search:"the
beatl as musician revolv through the antholog")~0.01
ps=1
URL: qf=all_search&pf=all_search&q="The Beatles as musicians : Revolver through the
Anthology"&qs=1&ps=1
final query: +(all_search:"the beatl as musician revolv through the antholog"~1)~0.01 (all_search:"the
beatl as musician revolv through the antholog"~1)~0.01

dismax QueryParser with qs=0: result!
ps=0
URL: qf=all_search&pf=all_search&q="The Beatles as musicians : Revolver through the
Anthology"&qs=0&ps=0
final query: +(all_search:"the beatl as musician revolv through the antholog")~0.01 (all_search:"the
beatl as musician revolv through the antholog")~0.01
ps=1
URL: qf=all_search&pf=all_search&q="The Beatles as musicians : Revolver through the
Anthology"&qs=0&ps=1
final query: +(all_search:"the beatl as musician revolv through the antholog")~0.01 (all_search:"the
beatl as musician revolv through the antholog"~1)~0.01

8.564867 = (MATCH) sum of:
4.2824335 = (MATCH) weight(all_search:"the beatl as musician revolv through the antholog"
in 1064395), product of:
<snip>
48.450203 = idf(all_search: the=3531140 beatl=398 as=645923 musician=11805 revolv=872 through=81366
the=3531140 antholog=11611)
<snip>

Solr 1.4: it works regardless of slop settings

lucene QueryParser with any proximity value: result!
~0
URL: q=all_search:"The Beatles as musicians : Revolver through the Anthology"
final query: all_search:"the beatl as musician revolv through the antholog"
~1
URL: q=all_search:"The Beatles as musicians : Revolver through the Anthology"~1
final query: all_search:"the beatl as musician revolv through the antholog"~1

5.2672544 = fieldWeight(all_search:"the beatl as musician revolv through the antholog" in
3469163), product of:
<snip>
48.157753 = idf(all_search: the=3549637 beatl=392 as=751093 musician=11992 revolv=822 through=88522
the=3549637 antholog=11246)
<snip>

dismax QueryParser with any qs: result!
qs=0, ps=0
URL: qf=all_search&pf=all_search&q="The Beatles as musicians : Revolver through the
Anthology"&qs=0&ps=0
final query: +(all_search:"the beatl as musician revolv through the antholog")~0.01 (all_search:"the
beatl as musician revolv through the antholog")~0.01
qs=0, ps=1
URL: qf=all_search&pf=all_search&q="The Beatles as musicians : Revolver through the
Anthology"&qs=0&ps=1
final query: +(all_search:"the beatl as musician revolv through the antholog")~0.01 (all_search:"the
beatl as musician revolv through the antholog"~1)~0.01
dismax QueryParser with qs=0: result!
qs=1, ps=0
URL: qf=all_search&pf=all_search&q="The Beatles as musicians : Revolver through the
Anthology"&qs=1&ps=0
final query: +(all_search:"the beatl as musician revolv through the antholog"~1)~0.01 (all_search:"the
beatl as musician revolv through the antholog")~0.01
qs=1, ps=1
URL: qf=all_search&pf=all_search&q="The Beatles as musicians : Revolver through the
Anthology"&qs=1&ps=1
final query: +(all_search:"the beatl as musician revolv through the antholog"~1)~0.01 (all_search:"the
beatl as musician revolv through the antholog"~1)~0.01

7.4490223 = (MATCH) sum of:
3.7245111 = weight(all_search:"the beatl as musician revolv through the antholog"~1 in 3469163),
product of:
<snip>
48.157753 = idf(all_search: the=3549637 beatl=392 as=751093 musician=11992 revolv=822 through=88522
the=3549637 antholog=11246)
<snip>

More information:

schema.xml:
<field name="all_search" type="text" indexed="true" stored="false" />

solr 3.5:
<fieldtype name="text" class="solr.TextField" positionIncrementGap="100" autoGeneratePhraseQueries="true">
<analyzer>
<tokenizer class="solr.WhitespaceTokenizerFactory" />
<filter class="solr.ICUFoldingFilterFactory"/>
<filter class="solr.WordDelimiterFilterFactory"
splitOnCaseChange="1" generateWordParts="1" catenateWords="1"
splitOnNumerics="0" generateNumberParts="1" catenateNumbers="1"
catenateAll="0" preserveOriginal="0" stemEnglishPossessive="1" />
<filter class="solr.EnglishPorterFilterFactory" protected="protwords.txt" />
<filter class="solr.RemoveDuplicatesTokenFilterFactory" />
</analyzer>
</fieldtype>

solr1.4:
<fieldtype name="text" class="solr.TextField" positionIncrementGap="100">
<analyzer>
<tokenizer class="solr.WhitespaceTokenizerFactory" />
<filter class="schema.UnicodeNormalizationFilterFactory" version="icu4j" composed="false"
remove_diacritics="true" remove_modifiers="true" fold="true" />
<filter class="solr.WordDelimiterFilterFactory"
splitOnCaseChange="1" generateWordParts="1" catenateWords="1"
splitOnNumerics="0" generateNumberParts="1" catenateNumbers="1"
catenateAll="0" preserveOriginal="0" stemEnglishPossessive="1" />
<filter class="solr.LowerCaseFilterFactory" />
<filter class="solr.EnglishPorterFilterFactory" protected="protwords.txt" />
<filter class="solr.RemoveDuplicatesTokenFilterFactory" />
</analyzer>
</fieldtype>

And the analysis page shows the same results for Solr 3.5 and 1.4

Solr 3.5:

position 1 2 3 4 5 6 7 8
term text the beatl as musician revolv through the antholog
keyword false false false false false false false false
startOffset 0 4 12 15 27 36 44 48
endOffset 3 11 14 24 35 43 47 57
type word word word word word word word word

Solr 1.4:

term position 1 2 3 4 5 6 7 8
term text the beatl as musician revolv through the antholog
term type word word word word word word word word
source start,end 0,3 4,11 12,14 15,24 27,35 36,43 44,47 48,57

For debug purposes, we can consider the Solr document as:

<doc>
<str name="all_search">The Beatles as musicians : Revolver through the Anthology</str>
</doc>

I can't attached the full SolrDoc as all_search is indexed, but not stored, and I use SolrJ
to write to the index from java objects ... plus our objects have a zillion fields (I work
in a library with very rich metadata and very exacting solr fields). I have attached the Solr
3.5 schema and solrconfig, but they are big and ugly for the same reasons.

For more details, see the erroneously titled email thread "result present in Solr 1.4 but
missing in Solr 3.5, dismax only" started on 2012-02-22 on solr-user@lucene.apache.org.

    Naomi

{noformat}
                
> search slop problem introduced somewhere between Solr 1.4 and Solr 3.5
> ----------------------------------------------------------------------
>
>                 Key: LUCENE-3821
>                 URL: https://issues.apache.org/jira/browse/LUCENE-3821
>             Project: Lucene - Java
>          Issue Type: Bug
>    Affects Versions: 3.5, 4.0
>            Reporter: Naomi Dushay
>         Attachments: schema.xml, solrconfig-test.xml
>
>
> In upgrading from Solr 1.4 to Solr 3.5, the following phrase searches stopped working
in dismax:
>   "The Beatles as musicians : Revolver through the Anthology"
>   "Color-blindness [print/digital]; its dangers and its detection"
> Both of these queries have a repeated work, and have many terms.  It's not the number
of terms or the colon surrounded by spaces, because the following phrase search works in Solr
3.5 (and Solr 1.4):
>     "International encyclopedia of revolution and protest : 1500 to the present"
> With Robert Muir's help, we have narrowed the problem down to slop  (proximity in lucene
QueryParser, query slop in dismax).   I have included debugQuery details for  the Beatles
search;  I confirmed the same behavior with the color-blindness search.
> Solr 3.5:   it fails when (query) slop setting isn't 0.
> ----
> lucene QueryParser with proximity set to 1 (or anything > 0) :  no match
>   URL: q=all_search:"The Beatles as musicians : Revolver through the Anthology"~1
>   final query:  all_search:"the beatl as musician revolv through the antholog"~1
> lucene QueryParser with proximity set to 0:    result!
>   URL:   q=all_search:"The Beatles as musicians : Revolver through the Anthology"
>   final query:  all_search:"the beatl as musician revolv through the antholog"
>   6.0562754 = (MATCH) weight(all_search:"the beatl as musician revolv through the antholog"
in 1064395), product of:
>      <snip>
>       48.450203 = idf(all_search: the=3531140 beatl=398 as=645923 musician=11805 revolv=872
through=81366 the=3531140 antholog=11611)
>      <snip>
> dismax QueryParser with qs=1:  no match
>       ps=0
>   URL:  qf=all_search&pf=all_search&q="The Beatles as musicians : Revolver through
the Anthology"&qs=1&ps=0
>   final query:   +(all_search:"the beatl as musician revolv through the antholog"~1)~0.01
(all_search:"the beatl as musician revolv through the antholog")~0.01
>       ps=1
>   URL:  qf=all_search&pf=all_search&q="The Beatles as musicians : Revolver through
the Anthology"&qs=1&ps=1
>   final query:   +(all_search:"the beatl as musician revolv through the antholog"~1)~0.01
(all_search:"the beatl as musician revolv through the antholog"~1)~0.01
> dismax QueryParser with qs=0:    result!
>      ps=0
>   URL:  qf=all_search&pf=all_search&q="The Beatles as musicians : Revolver through
the Anthology"&qs=0&ps=0
>   final query:  +(all_search:"the beatl as musician revolv through the antholog")~0.01
(all_search:"the beatl as musician revolv through the antholog")~0.01
>       ps=1
>   URL:  qf=all_search&pf=all_search&q="The Beatles as musicians : Revolver through
the Anthology"&qs=0&ps=1
>   final query:  +(all_search:"the beatl as musician revolv through the antholog")~0.01
(all_search:"the beatl as musician revolv through the antholog"~1)~0.01
>   8.564867 = (MATCH) sum of:
>     4.2824335 = (MATCH) weight(all_search:"the beatl as musician revolv through the antholog"
in 1064395), product of:
>         <snip>
>         48.450203 = idf(all_search: the=3531140 beatl=398 as=645923 musician=11805 revolv=872
through=81366 the=3531140 antholog=11611)
>         <snip>
> Solr 1.4:    it works regardless of slop settings
> ----
> lucene QueryParser with any proximity value:    result!
>       ~0
>   URL:   q=all_search:"The Beatles as musicians : Revolver through the Anthology"
>   final query:  all_search:"the beatl as musician revolv through the antholog"
>       ~1
>   URL: q=all_search:"The Beatles as musicians : Revolver through the Anthology"~1
>   final query:  all_search:"the beatl as musician revolv through the antholog"~1
>   5.2672544 = fieldWeight(all_search:"the beatl as musician revolv through the antholog"
in 3469163), product of:
>      <snip>
>     48.157753 = idf(all_search: the=3549637 beatl=392 as=751093 musician=11992 revolv=822
through=88522 the=3549637 antholog=11246)
>      <snip>
> dismax QueryParser with any qs:    result!
>       qs=0, ps=0
>    URL:  qf=all_search&pf=all_search&q="The Beatles as musicians : Revolver through
the Anthology"&qs=0&ps=0
>    final query: +(all_search:"the beatl as musician revolv through the antholog")~0.01
(all_search:"the beatl as musician revolv through the antholog")~0.01
>       qs=0, ps=1
>    URL:  qf=all_search&pf=all_search&q="The Beatles as musicians : Revolver through
the Anthology"&qs=0&ps=1
>    final query: +(all_search:"the beatl as musician revolv through the antholog")~0.01
(all_search:"the beatl as musician revolv through the antholog"~1)~0.01
> dismax QueryParser with qs=0:    result!
>       qs=1, ps=0
>    URL:  qf=all_search&pf=all_search&q="The Beatles as musicians : Revolver through
the Anthology"&qs=1&ps=0
>    final query: +(all_search:"the beatl as musician revolv through the antholog"~1)~0.01
(all_search:"the beatl as musician revolv through the antholog")~0.01
>       qs=1, ps=1
>    URL:  qf=all_search&pf=all_search&q="The Beatles as musicians : Revolver through
the Anthology"&qs=1&ps=1
>    final query: +(all_search:"the beatl as musician revolv through the antholog"~1)~0.01
(all_search:"the beatl as musician revolv through the antholog"~1)~0.01
>   7.4490223 = (MATCH) sum of:
>   3.7245111 = weight(all_search:"the beatl as musician revolv through the antholog"~1
in 3469163), product of:
>         <snip>
>       48.157753 = idf(all_search: the=3549637 beatl=392 as=751093 musician=11992 revolv=822
through=88522 the=3549637 antholog=11246)
>         <snip>
> More information:
> schema.xml:
>   <field name="all_search" type="text" indexed="true" stored="false" />
> solr 3.5:
>       <fieldtype name="text" class="solr.TextField" positionIncrementGap="100" autoGeneratePhraseQueries="true">
>       <analyzer>
>         <tokenizer class="solr.WhitespaceTokenizerFactory" />
>         <filter class="solr.ICUFoldingFilterFactory"/>  
>         <filter class="solr.WordDelimiterFilterFactory"
>           splitOnCaseChange="1" generateWordParts="1" catenateWords="1"
>           splitOnNumerics="0" generateNumberParts="1" catenateNumbers="1"
>           catenateAll="0" preserveOriginal="0" stemEnglishPossessive="1" />
>         <filter class="solr.EnglishPorterFilterFactory" protected="protwords.txt"
/>
>         <filter class="solr.RemoveDuplicatesTokenFilterFactory" />
>       </analyzer>
>     </fieldtype>
> solr1.4:
> <fieldtype name="text" class="solr.TextField" positionIncrementGap="100">
>       <analyzer>
>         <tokenizer class="solr.WhitespaceTokenizerFactory" />
>         <filter class="schema.UnicodeNormalizationFilterFactory" version="icu4j" composed="false"
remove_diacritics="true" remove_modifiers="true" fold="true" />
>         <filter class="solr.WordDelimiterFilterFactory" 
>           splitOnCaseChange="1" generateWordParts="1" catenateWords="1" 
>           splitOnNumerics="0" generateNumberParts="1" catenateNumbers="1" 
>           catenateAll="0" preserveOriginal="0" stemEnglishPossessive="1" />
>         <filter class="solr.LowerCaseFilterFactory" />
>         <filter class="solr.EnglishPorterFilterFactory" protected="protwords.txt"
/>
>         <filter class="solr.RemoveDuplicatesTokenFilterFactory" />
>       </analyzer>
>     </fieldtype>
> And the analysis page shows the same results for Solr 3.5 and 1.4
> Solr 3.5:
> position 	1	2	3	4	5	6	7	8
> term text 	the	beatl	as	musician	revolv	through	the	antholog
> keyword 	false	false	false	false	false	false	false	false
> startOffset 	0	4	12	15	27	36	44	48
> endOffset 	3	11	14	24	35	43	47	57
> type 	word	word	word	word	word	word	word	word
> Solr 1.4:
> term position 	1	2	3	4	5	6	7	8
> term text 	the	beatl	as	musician	revolv	through	the	antholog
> term type 	word	word	word	word	word	word	word	word
> source start,end 	0,3	4,11	12,14	15,24	27,35	36,43	44,47	48,57
> For debug purposes, we can consider the Solr document as:
> <doc>
>   <str name="all_search">The Beatles as musicians : Revolver through the Anthology</str>
> </doc>
> I can't attached the full SolrDoc as all_search is indexed, but not stored, and I use
SolrJ to write to the index from java objects ... plus our objects have a zillion fields (I
work in a library with very rich metadata and very exacting solr fields).  I have attached
the Solr 3.5 schema and solrconfig, but they are big and ugly for the same reasons.
> For more details, see the erroneously titled email thread "result present in Solr 1.4
but missing in Solr 3.5, dismax only"  started on 2012-02-22 on solr-user@lucene.apache.org.
> - Naomi

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


Mime
View raw message