lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Peter Keegan <peterlkee...@gmail.com>
Subject Re: sloppyFreq question
Date Mon, 09 Mar 2009 14:05:15 GMT
The reason I asked about Span scoring is that the behavior changed when I
switched from TermQuery to BoostingTermQuery to take advantage of payloads.

It seems to me that a SpanTermQuery and BoostingTermQuery should behave the
same as TermQuery with respect to term frequency. The 'edit distance' isn't
really relevant for these queries, is it?

For a SpanNearQuery that contains SpanTermQueries, the score for a match on
"the quick brown fox" would be lower than a match on "brown fox" because of
the edit distance (4 vs 2). This seems counter intuitive, too.

Any comments?

Thanks,
Peter


On Tue, Mar 3, 2009 at 2:42 PM, Peter Keegan <peterlkeegan@gmail.com> wrote:

> The DefaultSimilarity class defines sloppyFreq as:
>
> public float sloppyFreq(int distance) {
>   return 1.0f / (distance + 1);
> }
>
> For a 'SpanNearQuery', this reduces the effect of the term frequency on the
> score as the number of terms in the span increases. So, for a simple phrase
> query (using spans), the longer the phrase, the lower the TF. For a simple
> SpanTermQuery, the TF is reduced in half (1.0f / 1 + 1).
>
> I'm just wondering why this is the default behavior. For 'SpanTermQuery',
> I'd expect the TF to reflect the actual number of occurrences of the term.
> For a SpanNearQuery, wouldn't it still be the number of occurrences of the
> whole span, not the number of terms in the span?
>
> Thanks,
> Peter
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message