Hello folks,
Maybe one of you can help me with this (sorry, long read).
I have implemented a FuzzyPhraseQuery that works similar to Lucene's
native PhraseQuery.
I.e. it can retrieve phrases for a query, with respect to insertions
and term order.
But in addition it can also find matches with terms missing (deletions).
Scoring is implemented as described here:
http://www.gossamerthreads.com/lists/lucene/javauser/33558#33558
So the scorer uses the total error rather than the maximum error for
insertions and outoforder. That part works all fine (eventhough the
total errors I'm observing quickly lead to very low frequencies
returned by sloppyFreq() )
Now my problem is with scoring the deletion cases.
My initial idea was to penalize a missing term position with its maximum error.
Consider this:
Query: a b c d
Document A: b c d
Term a is missing, score it as if it was at the worst position possible
result: b c d a
pos. diffs: 1 1 1 +3
It can be observed that the max error for the nth missing term is 2n  2
If you have a query given with 100 terms and say 10 of them are not
found, I would have a penalty of 190 + 192 + 194 etc.
for extreme cases, this is rather simple to calculate. in the middle
of a phrase, things get tricky though. Also the penalty becomes higher
as the number of terms increases.
So I think this is no viable solution for my problem.
Does anyone know a better solution for scoring deletion cases?
Thanks for your input,
Philipp

