lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Doron Cohen (Commented) (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (LUCENE-3821) SloppyPhraseScorer sometimes misses documents that ExactPhraseScorer finds.
Date Tue, 06 Mar 2012 08:08:58 GMT

    [ https://issues.apache.org/jira/browse/LUCENE-3821?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13223077#comment-13223077
] 

Doron Cohen commented on LUCENE-3821:
-------------------------------------

Thanks Robert, okay, I'll continue with option 2 then.

In addition, perhaps should try harder for a sloppy version of current ExactPhraseScorer,
for both performance and correctness reasons. 

In ExactPhraseScorer, the increment of count[posIndex] is by 1, each time the conditions for
a match (still) holds.  

A sloppy version of this, with N terms and slop=S could increment differently:
{noformat}
1 + N*S        at posIndex
1 + N*S - 1    at posIndex-1 and posIndex+1
1 + N*S - 2 at posIndex-2 and posIndex+3
...
1 + N*S - S at posIndex-S and posIndex+S
{noformat}

For S=0, this falls back to only increment by 1 and only at posIndex, same as the ExactPhraseScorer,
which makes sense.

Also, the success criteria in ExactPhraseScorer, when checking term k, is, before adding up
1 for term k:
* count[posIndex] == k-1
Or, after adding up 1 for term k:
* count[posIndex] == k

In the sloppy version the criteria (after adding up term k) would be:
* count[posIndex] >= k*(1+N*S)-S

Again, for S=0 this falls to the ExactPhraseScorer logic:
* count[posIndex] >= k  

Mike (and all), correctness wise, what do you think?

If you are wondering why the increment at the actual position is (1 + N*S) - it allows to
match only posIndexes where all terms contributed something.

I drew an example with 5 terms and slop=2 and so far it seems correct.

Also tried 2 terms and slop=5, this seems correct as well, just that, when there is a match,
several posIndexes will contribute to the score of the same match. I think this is not too
bad, as for these parameters, same behavior would be in all documents. I would be especially
forgiving for this if we this way get some of the performance benefits of the ExactPhraseScorer.

If we agree on correctness, need to understand how to implement it, and consider the performance
effect. The tricky part is to increment at posIndex-n. Say there are 3 terms in the query
and one of the terms is found at indexes 10, 15, 18. Assume the slope is 2. Since N=3, the
max increment is:
- 1 + N*S = 1 + 3*2 = 7.

So the increments for this term would be (pos, incr):
{noformat}
Pos   Increment
---   ---------
 8  ,  5
 9  ,  6
10  ,  7
11  ,  6
12  ,  5
13  ,  5
14  ,  6
15  ,  7   =  max(7,5)
16  ,  6   =  max(6,5)
17  ,  6   =  max(5,6)
18  ,  7
19  ,  6
20  ,  5
{noformat}

So when we get to posIndex 17, we know that posIndex 15 contributes 5, but we do not know
yet about the contribution of posIndex 18, which is 6, and should be used instead of 5. So
some look-ahead (or some fix-back) is required, which will complicate the code.

If this seems promising, should probably open a new issue for it, just wanted to get some
feedback first.
                
> SloppyPhraseScorer sometimes misses documents that ExactPhraseScorer finds.
> ---------------------------------------------------------------------------
>
>                 Key: LUCENE-3821
>                 URL: https://issues.apache.org/jira/browse/LUCENE-3821
>             Project: Lucene - Java
>          Issue Type: Bug
>    Affects Versions: 3.5, 4.0
>            Reporter: Naomi Dushay
>            Assignee: Doron Cohen
>         Attachments: LUCENE-3821.patch, LUCENE-3821.patch, LUCENE-3821.patch, LUCENE-3821.patch,
LUCENE-3821_test.patch, schema.xml, solrconfig-test.xml
>
>
> The general bug is a case where a phrase with no slop is found,
> but if you add slop its not.
> I committed a test today (TestSloppyPhraseQuery2) that actually triggers this case,
> jenkins just hasn't had enough time to chew on it.
> ant test -Dtestcase=TestSloppyPhraseQuery2 -Dtests.iter=100 is enough to make it fail
on trunk or 3.x

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


Mime
View raw message