lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Philipp Nanz" <>
Subject Re: alternative scoring algorithm for PhraseQuery
Date Wed, 07 Mar 2007 17:12:14 GMT
Thanks for your answers. Your input is really appreciated :-)

@Paul Elschot:
Thanks for the hint. I guess I could use coord() to penalize missing
terms like this:

Query: a b c d
Doc A: a b c d => sloppyFreq(0) * coord(4, 4) = 1
Doc B: a b c => sloppyFreq(0) * coord(3, 4) = 0,75

Doc would score higher. I guess that might be a valid solution.

There is a drawback though, i.e. sloppyFreq(1) * coord(4, 4) = 0,5

So a perfect match with one insertion would score less than a 3 of 4
match with no slop.

As for spanqueries:
My implementation is based of the default PhraseQuery with slop > 0. I
don't know the inner workings of SpanQueries, but what you describe
sounds alot like what the PhraseQuery does as well (i.e. calculate max
distance between last and first term, and use that with sloppyFreq()).

I chose PhraseQuery as base of my work, because I felt that it would
offer better performance than firing off a plethora of spanqueries to
express the same query.

Long story short: My problem would generalize to spanqueries if
spanqueries would face the problem of deleted terms. But I guess they

@Chris Hostetter: You are absolutely right. But it shows off into
which direction it could go to. Perhaps I could add +1 (or some other
amount) as additional penalty to the maximum error for missing terms
to distinguish between these cases further.

But still this could lead to a case where

Doc A: a b c x1 x2 [more x...] xn d
will be scored lower than
Doc B: a b c
(because the distance of A can exceed the penalty for the missing term
- its only a matter of choosing the right n)
which is questionable as well.

2007/3/6, Chris Hostetter <>:
> : My initial idea was to penalize a missing term position with its maximum error.
> :
> : Consider this:
> : Query:  a b c d
> : Document A: b c d
> :
> : Term a is missing, score it as if it was at the worst position possible
> :
> : result:       b c d a
> : pos. diffs: -1 -1 -1 +3
> side comment: this doesn't sound very useful, a document containing "b c
> d" matches equally to a doc containing "b c d a" ? ... shouldn't a doc
> containing "b c d a" be considered a much better match since it at least
> contains all of the terms close together?
> -Hoss
> ---------------------------------------------------------------------
> To unsubscribe, e-mail:
> For additional commands, e-mail:

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message