lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Philipp Nanz" <phili...@gmail.com>
Subject alternative scoring algorithm for PhraseQuery
Date Tue, 06 Mar 2007 00:08:05 GMT
Hello folks,

Maybe one of you can help me with this (sorry, long read).

I have implemented a FuzzyPhraseQuery that works similar to Lucene's
native PhraseQuery.
I.e. it can retrieve phrases for a query, with respect to insertions
and term order.
But in addition it can also find matches with terms missing (deletions).

Scoring is implemented as described here:
http://www.gossamer-threads.com/lists/lucene/java-user/33558#33558

So the scorer uses the total error rather than the maximum error for
insertions and out-of-order. That part works all fine (eventhough the
total errors I'm observing quickly lead to very low frequencies
returned by sloppyFreq() )

Now my problem is with scoring the deletion cases.

My initial idea was to penalize a missing term position with its maximum error.

Consider this:

Query:  a b c d

Document A: b c d

Term a is missing, score it as if it was at the worst position possible

result:       b c d a
pos. diffs: -1 -1 -1 +3

It can be observed that the max error for the nth missing term is 2n - 2
If you have a query given with 100 terms and say 10 of them are not
found, I would have a penalty of 190 + 192 + 194 etc.

for extreme cases, this is rather simple to calculate. in the middle
of a phrase, things get tricky though. Also the penalty becomes higher
as the number of terms increases.

So I think this is no viable solution for my problem.

Does anyone know a better solution for scoring deletion cases?

Thanks for your input,
Philipp

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message