Mailing-List: contact java-dev-help@lucene.apache.org; run by ezmlm
Precedence: bulk
Reply-To: java-dev@lucene.apache.org
Message-ID: <17551634.1161990318264.JavaMail.root@brutus>
Date: Fri, 27 Oct 2006 16:05:18 -0700 (PDT)
From: "Doron Cohen (JIRA)" <jira@apache.org>
To: java-dev@lucene.apache.org
Subject: [jira] Updated: (LUCENE-697) Scorer.skipTo affects sloppyPhrase
 scoring
In-Reply-To: <26397508.1161748396540.JavaMail.root@brutus>
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: 7bit

     [ http://issues.apache.org/jira/browse/LUCENE-697?page=all ]

Doron Cohen updated LUCENE-697:
-------------------------------

    Attachment: sloppy_phrase_skipTo.patch

This was tricky, for me anyhow, but I think I found it.

The difference in scoring between using next() to using skipTo() (or a combination of these two) is caused by two (valid) orders of the sorted PhrasePositions. 

Currently PhrasePositions sorting is defined by doc and position, where position already considers the offset of the term within the (phrase) query. 

If however two TermPosition have the same doc and same position, the sort takes no decision, which falls down to one valid sort (by current sort definition). The difference between using next() and skipTo() in this regard is that skipTo() always calls sort(), sorting the entire set, while next() only calls sort() at initialization and then maintain the sorting as part of the scoring process. 

This would be clearer with the following example - taken from Yonik's test case that is failing now:
   - Doc1:     w1 w3 w2 w3 zz
   - Query:   "w3 w2"~2
When starting scoring in this doc, both PhrasePositions pp(w3) and pp(w2) have doc(2)=doc(w3)=1.
Note, that, for the second w3 that matches we would have pos(w2)=2+1=3 and pos(w3)=3+0=3. 

So, after scoring doc1("w3 w2"), if the sort result places pp(w2) at the top, we would also score for doc1("w3 w2"). However, if pp(w3) is placed by the sort at the top (==smallest), we would not score also for doc1("w3 w2"). 

Current behavior is inconsistent: skip() would take the two while next() won't, and I think it is possible to create a case where it would be the other way around. So definitely behavior should be made consistent. 

Next question to be asked is: Do we want to sum (or max) the frequency for both (or more cases)? I think yes, sum. 

To fix this I am changing PhrasePosition comparison, so that in case positions are equal, the actual doc position (ignoring offset in query phrase) is considered. 

In addition, I added missing calls to clear the priority queue before starting to sort and to mark that no more initialization is required when skipTo() is called. 

I tested with the sequence that Yonik added:
    - skip skip next next skip skip 
And also with the sequences:
    - skip skip skip skip skip skip
    - next next next next next next 
    - skip next skip next skip next 
    - next skip next skip next skip
    - next next skip skip next next
The latter 5 cases are now commented out, the first case is in effect.

This scoring code is still not feeling natural to me, so (actually as always) comments will be appreciated.

- Doron

> Scorer.skipTo affects sloppyPhrase scoring
> ------------------------------------------
>
>                 Key: LUCENE-697
>                 URL: http://issues.apache.org/jira/browse/LUCENE-697
>             Project: Lucene - Java
>          Issue Type: Bug
>          Components: Search
>    Affects Versions: 2.0.0
>            Reporter: Yonik Seeley
>         Assigned To: Doron Cohen
>         Attachments: sloppy_phrase_skipTo.patch
>
>
> If you mix skipTo() and next(), you get different scores than what is returned to a hit collector.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org