lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Doron Cohen (JIRA)" <j...@apache.org>
Subject [jira] Updated: (LUCENE-697) Scorer.skipTo affects sloppyPhrase scoring
Date Fri, 27 Oct 2006 23:05:18 GMT
     [ http://issues.apache.org/jira/browse/LUCENE-697?page=all ]

Doron Cohen updated LUCENE-697:
-------------------------------

    Attachment: sloppy_phrase_skipTo.patch

This was tricky, for me anyhow, but I think I found it.

The difference in scoring between using next() to using skipTo() (or a combination of these
two) is caused by two (valid) orders of the sorted PhrasePositions. 

Currently PhrasePositions sorting is defined by doc and position, where position already considers
the offset of the term within the (phrase) query. 

If however two TermPosition have the same doc and same position, the sort takes no decision,
which falls down to one valid sort (by current sort definition). The difference between using
next() and skipTo() in this regard is that skipTo() always calls sort(), sorting the entire
set, while next() only calls sort() at initialization and then maintain the sorting as part
of the scoring process. 

This would be clearer with the following example - taken from Yonik's test case that is failing
now:
   - Doc1:     w1 w3 w2 w3 zz
   - Query:   "w3 w2"~2
When starting scoring in this doc, both PhrasePositions pp(w3) and pp(w2) have doc(2)=doc(w3)=1.
Note, that, for the second w3 that matches we would have pos(w2)=2+1=3 and pos(w3)=3+0=3.


So, after scoring doc1("w3 w2"), if the sort result places pp(w2) at the top, we would also
score for doc1("w3 w2"). However, if pp(w3) is placed by the sort at the top (==smallest),
we would not score also for doc1("w3 w2"). 

Current behavior is inconsistent: skip() would take the two while next() won't, and I think
it is possible to create a case where it would be the other way around. So definitely behavior
should be made consistent. 

Next question to be asked is: Do we want to sum (or max) the frequency for both (or more cases)?
I think yes, sum. 

To fix this I am changing PhrasePosition comparison, so that in case positions are equal,
the actual doc position (ignoring offset in query phrase) is considered. 

In addition, I added missing calls to clear the priority queue before starting to sort and
to mark that no more initialization is required when skipTo() is called. 

I tested with the sequence that Yonik added:
    - skip skip next next skip skip 
And also with the sequences:
    - skip skip skip skip skip skip
    - next next next next next next 
    - skip next skip next skip next 
    - next skip next skip next skip
    - next next skip skip next next
The latter 5 cases are now commented out, the first case is in effect.

This scoring code is still not feeling natural to me, so (actually as always) comments will
be appreciated.

- Doron

> Scorer.skipTo affects sloppyPhrase scoring
> ------------------------------------------
>
>                 Key: LUCENE-697
>                 URL: http://issues.apache.org/jira/browse/LUCENE-697
>             Project: Lucene - Java
>          Issue Type: Bug
>          Components: Search
>    Affects Versions: 2.0.0
>            Reporter: Yonik Seeley
>         Assigned To: Doron Cohen
>         Attachments: sloppy_phrase_skipTo.patch
>
>
> If you mix skipTo() and next(), you get different scores than what is returned to a hit
collector.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Mime
View raw message