lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Doron Cohen (Commented) (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (LUCENE-3821) SloppyPhraseScorer sometimes misses documents that ExactPhraseScorer finds.
Date Tue, 06 Mar 2012 21:55:57 GMT

    [ https://issues.apache.org/jira/browse/LUCENE-3821?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13223700#comment-13223700
] 

Doron Cohen commented on LUCENE-3821:
-------------------------------------

I'm afraid it won't solve the problem.

The complicity of SloppyPhraseScorer stems firstly from the slop.
That part is handled in the scorer for long time.

Two additional complications are repeating terms, and multi-term phrases.
Each one of these, separately, is handled as well.
Their combination however, is the cause for this discussion.

To prevent two repeating terms from landing on the same document position, we propagate the
smaller of them (smaller in its phrase-position, which takes into account both the doc-position
and the offset of that term in the query).

Without this special treatment, a phrase query "a b a"~2 might match a document "a b", because
both "a"'s (query terms) will land on the same document's "a". This is illegal and is prevented
by such propagation. 

But when one of the repeating terms is a multi-term, it is not possible to know which of the
repeating terms to propagate. This is the unsolved bug.

Now, back to current ExactPhraseScorer.
It does not have this problem with repeating terms.
But not because of the different algorithm - rather because of the different scenario.
It does not have this problem because exact phrase scoring does not have it.
In exact phrase scoring, a match is declared only when all PPs are in the same phrase position.
Recall that phrase position = doc-position - query-offset, it is visible that when two PPs
with different query offset are in the same phrase-position, their doc-position cannot be
the same, and therefore no special handling is needed for repeating terms in exact phrase
scorers.

However, once we will add that slopy-decaying frequency, we will match in certain posIndex,
different phrase-positions. This is because of the slop. So they might land on the same doc-position,
and then we start again...

This is really too bad. Sorry for the lengthy post, hopefully this would help when someone
wants to get into this.

Back to option 2.
                
> SloppyPhraseScorer sometimes misses documents that ExactPhraseScorer finds.
> ---------------------------------------------------------------------------
>
>                 Key: LUCENE-3821
>                 URL: https://issues.apache.org/jira/browse/LUCENE-3821
>             Project: Lucene - Java
>          Issue Type: Bug
>    Affects Versions: 3.5, 4.0
>            Reporter: Naomi Dushay
>            Assignee: Doron Cohen
>         Attachments: LUCENE-3821-SloppyDecays.patch, LUCENE-3821.patch, LUCENE-3821.patch,
LUCENE-3821.patch, LUCENE-3821.patch, LUCENE-3821_test.patch, schema.xml, solrconfig-test.xml
>
>
> The general bug is a case where a phrase with no slop is found,
> but if you add slop its not.
> I committed a test today (TestSloppyPhraseQuery2) that actually triggers this case,
> jenkins just hasn't had enough time to chew on it.
> ant test -Dtestcase=TestSloppyPhraseQuery2 -Dtests.iter=100 is enough to make it fail
on trunk or 3.x

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


Mime
View raw message