lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Doron Cohen (Commented) (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (LUCENE-3821) SloppyPhraseScorer sometimes misses documents that ExactPhraseScorer finds.
Date Sun, 04 Mar 2012 12:59:59 GMT

    [ https://issues.apache.org/jira/browse/LUCENE-3821?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13221879#comment-13221879
] 

Doron Cohen commented on LUCENE-3821:
-------------------------------------

I think I understand the cause.

In current implementation there is an assumption that once we landed on the first candidate
document, it is possible to check if there are repeating pps, by just comparing the in-doc-positions
of the terms. 

Tricky as it is, while this is true for plain PhrasePositions, it is not true for MultiPhrasePositions
- assume to MPPs: (a m n) and (b x y), and first candidate document that starts with "a b".
The in-doc-positions of the two pps would be 0,1 respectively (for 'a' and 'b') and we would
not even detect the fact that there are repetitions, not to mention not putting them in the
same group.

MPPs conflicts with current patch in an additional manner: It is now assumed that each repetition
can be assigned a repetition group. 

So assume these PPs (and query positions): 
0:a 1:b 3:a 4:b 7:c
There are clearly two repetition groups {0:a, 3:a} and {1:b, 4:b}, 
while 7:c is not a repetition.

But assume these PPs (and query positions): 
0:(a b) 1:(b x) 3:a 4:b 7:(c x)
We end up with a single large repetition group:
{0:(a b) 1:(b x) 3:a 4:b 7:(c x)}

I think if the groups are created correctly at the first candidate document, scorer logic
would still work, as a collision is decided only when two pps are in the same in-doc-position.
The only impact of MPPs would be performance cost: since repetition groups are larger, it
would take longer to check if there are repetitions.

Just need to figure out how to detect repetition groups without relying on in-(first-)doc-positions.
                
> SloppyPhraseScorer sometimes misses documents that ExactPhraseScorer finds.
> ---------------------------------------------------------------------------
>
>                 Key: LUCENE-3821
>                 URL: https://issues.apache.org/jira/browse/LUCENE-3821
>             Project: Lucene - Java
>          Issue Type: Bug
>    Affects Versions: 3.5, 4.0
>            Reporter: Naomi Dushay
>            Assignee: Doron Cohen
>         Attachments: LUCENE-3821.patch, LUCENE-3821.patch, LUCENE-3821.patch, LUCENE-3821_test.patch,
schema.xml, solrconfig-test.xml
>
>
> The general bug is a case where a phrase with no slop is found,
> but if you add slop its not.
> I committed a test today (TestSloppyPhraseQuery2) that actually triggers this case,
> jenkins just hasn't had enough time to chew on it.
> ant test -Dtestcase=TestSloppyPhraseQuery2 -Dtests.iter=100 is enough to make it fail
on trunk or 3.x

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


Mime
View raw message