lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Chris Hostetter <hossman_luc...@fucit.org>
Subject Re: Modifying score based on tf and slop
Date Sun, 05 Jul 2009 23:07:01 GMT

(Disclaimer: i'm not currently looking at the code, this email is entirely 
a guess based on what i remember about SpanQueries)

: II ) Using default implementation of tf in Similarity class:
: 
: Case 1 -  Doc : "AB BC BC CD"
: Result :  4  - Actual score
: % match :  ( actual score / max possible score) =  ( 4/3) > 100% - This is
: Wrong as I dont want score to be affected by no of times BC occurs

I suspect you are missunderstanding why you are getting the scores you are 
getting.

if i remember correctly, SpanNearQuery ignores all score information 
coming from the sub-queries it contains and only scores documents based on 
the distances of the matching Spans (this is true for all of hte 
"container" span queries i believe - because they all use SpanScorer does 
and it *only* looks at the Spans)

So i don't think anything in your SpanNearQuery is actually rewarding a 
doc for matching one of the individual terms more then once, because 
nothing ever looks at the tf() of the individual terms.  (if you use a 
custom Similarity, and override the tf(int) method to include some 
logging, i'm 90% certain you'll see that that method never get called with 
any SpanQuery)

SpanScorer *does* look at every matching Span in a document however -- and 
assuming you are allowing slop (and it appears you are since other 
examples you list depend on it) the sequence "AB BC CD" exists twice in 
your example document above -- once using the BC at position 2, and once 
using the BC at position 3 - hence the higher then (you) expected score.  
(if you use a custom Similarity, and override the tf(float) method to 
include some logging, i'm 90% certain you'll see that that method get  
called twice for that span query against an index with only that document 
-- once per instance of the span.

I'm fairly certain that finding overlapping spans is considered a 
"feature" of SpanQuery.  I suspect if you look through the test cases for 
SpanNearQuery you'll even find some examples just like yours where it 
requires that their be multiple matches.


looking at the online javadocs, i don't see any simple option to prevent 
overlapping spans when constructing the SpanNearQuery, but i think it 
would be fairly easy for you to subclass SpanQuery so it returns a new 
NearSpansNoOverlapping that you would have wrap the NearSpansOrdered and 
only return the "shortest" span from each doucment.

Incidently: if you find subclassing SpanNearQuery tedious to do what you, 
keep in mind that you don't have to go use IndexSearcher and and deal with 
the normal scoring (Using slopyFreq & tf) if you don't wnat to -- if you 
really just want to know about the lengths of instances of Spans in your 
index, you can call the getSpans method directly on your SpanNearQuery and 
iterate over them yourself, ignoring the ones you want to ignore.



-Hoss


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message