lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Michael Ryan (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (LUCENE-3412) SloppyPhraseScorer returns non-deterministic results for queries with many repeats
Date Fri, 02 Sep 2011 21:47:09 GMT

    [ https://issues.apache.org/jira/browse/LUCENE-3412?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13096344#comment-13096344
] 

Michael Ryan commented on LUCENE-3412:
--------------------------------------

Here's the debugQuery output from when it matched both docs:
{noformat}
<lst name="explain"><str name="2">
1.1890696 = (MATCH) weight(text:"dog dog dog dog"~1 in 1) [DefaultSimilarity], result of:
  1.1890696 = score(doc=1,freq=1.0 = phraseFreq=1.0
), product of:
    0.99999994 = queryWeight, product of:
      2.3781395 = idf(), sum of:
        0.5945349 = idf(docFreq=2, maxDocs=2)
        0.5945349 = idf(docFreq=2, maxDocs=2)
        0.5945349 = idf(docFreq=2, maxDocs=2)
        0.5945349 = idf(docFreq=2, maxDocs=2)
      0.42049676 = queryNorm
    1.1890697 = fieldWeight in 1, product of:
      1.0 = tf(freq=1.0), with freq of:
        1.0 = phraseFreq=1.0
      2.3781395 = idf(), sum of:
        0.5945349 = idf(docFreq=2, maxDocs=2)
        0.5945349 = idf(docFreq=2, maxDocs=2)
        0.5945349 = idf(docFreq=2, maxDocs=2)
        0.5945349 = idf(docFreq=2, maxDocs=2)
      0.5 = fieldNorm(doc=1)
</str><str name="1">
0.8407992 = (MATCH) weight(text:"dog dog dog dog"~1 in 0) [DefaultSimilarity], result of:
  0.8407992 = score(doc=0,freq=0.5 = phraseFreq=0.5
), product of:
    0.99999994 = queryWeight, product of:
      2.3781395 = idf(), sum of:
        0.5945349 = idf(docFreq=2, maxDocs=2)
        0.5945349 = idf(docFreq=2, maxDocs=2)
        0.5945349 = idf(docFreq=2, maxDocs=2)
        0.5945349 = idf(docFreq=2, maxDocs=2)
      0.42049676 = queryNorm
    0.8407993 = fieldWeight in 0, product of:
      0.70710677 = tf(freq=0.5), with freq of:
        0.5 = phraseFreq=0.5
      2.3781395 = idf(), sum of:
        0.5945349 = idf(docFreq=2, maxDocs=2)
        0.5945349 = idf(docFreq=2, maxDocs=2)
        0.5945349 = idf(docFreq=2, maxDocs=2)
        0.5945349 = idf(docFreq=2, maxDocs=2)
      0.5 = fieldNorm(doc=0)
</str></lst>
{noformat}

Sometimes when it matches both docs I'll get "no matching term" for the second one:
{noformat}
<lst name="explain"><str name="2">
1.1890696 = (MATCH) weight(text:"dog dog dog dog"~1 in 1) [DefaultSimilarity], result of:
  1.1890696 = score(doc=1,freq=1.0 = phraseFreq=1.0
), product of:
    0.99999994 = queryWeight, product of:
      2.3781395 = idf(), sum of:
        0.5945349 = idf(docFreq=2, maxDocs=2)
        0.5945349 = idf(docFreq=2, maxDocs=2)
        0.5945349 = idf(docFreq=2, maxDocs=2)
        0.5945349 = idf(docFreq=2, maxDocs=2)
      0.42049676 = queryNorm
    1.1890697 = fieldWeight in 1, product of:
      1.0 = tf(freq=1.0), with freq of:
        1.0 = phraseFreq=1.0
      2.3781395 = idf(), sum of:
        0.5945349 = idf(docFreq=2, maxDocs=2)
        0.5945349 = idf(docFreq=2, maxDocs=2)
        0.5945349 = idf(docFreq=2, maxDocs=2)
        0.5945349 = idf(docFreq=2, maxDocs=2)
      0.5 = fieldNorm(doc=1)
</str><str name="1">
0.0 = (NON-MATCH) no matching term
</str></lst>
{noformat}

> SloppyPhraseScorer returns non-deterministic results for queries with many repeats
> ----------------------------------------------------------------------------------
>
>                 Key: LUCENE-3412
>                 URL: https://issues.apache.org/jira/browse/LUCENE-3412
>             Project: Lucene - Java
>          Issue Type: Bug
>          Components: core/search
>    Affects Versions: 3.1, 3.2, 3.3, 4.0
>            Reporter: Michael Ryan
>
> Proximity queries with many repeats (four or more, based on my testing) return non-deterministic
results. I run the same query multiple times with the same data set and get different results.
> So far I've reproduced this with Solr 1.4.1, 3.1, 3.2, 3.3, and latest 4.0 trunk.
> Steps to reproduce (using the Solr example):
> 1) In solrconfig.xml, set queryResultCache size to 0.
> 2) Add some documents with text "dog dog dog" and "dog dog dog dog". http://localhost:8983/solr/update?stream.body=%3Cadd%3E%3Cdoc%3E%3Cfield%20name=%22id%22%3E1%3C/field%3E%3Cfield%20name=%22text%22%3Edog%20dog%20dog%3C/field%3E%3C/doc%3E%3Cdoc%3E%3Cfield%20name=%22id%22%3E2%3C/field%3E%3Cfield%20name=%22text%22%3Edog%20dog%20dog%20dog%3C/field%3E%3C/doc%3E%3C/add%3E&commit=true
> 3) Do a "dog dog dog dog"~1 query. http://localhost:8983/solr/select?q=%22dog%20dog%20dog%20dog%22~1
> 4) Repeat step 3 many times.
> Expected results: The document with id 2 should be returned.
> Actual results: The document with id 2 is always returned. The document with id 1 is
sometimes returned.
> Different proximity values show the same bug - "dog dog dog dog"~5, "dog dog dog dog"~100,
etc show the same behavior.
> So far I've traced it down to the "repeats" array in SloppyPhraseScorer.initPhrasePositions()
- depending on the order of the elements in this array, the document may or may not match.
I think the HashSet may be to blame, but I'm not sure - that at least seems to be where the
non-determinism is coming from.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


Mime
View raw message