Another thought on fuzzy scoring:
shouldn't all these queries which automatically expand
terms favour common words over rare ones? The default
scoring behaviour at the moment favours rare words. As
a user aren't I more likely to be looking for the most
common expansions?
If I'm not sure how to spell I might search for:
accomodation~
or
accom*
The fuzzy scoring algorithms will currently favour all
of the mis-spellings of accommodation in the ranking
of results because they are more rare.
Ideally within the expansions of a term the score
contribution should be based on df (as opposed to the
usual idf) BUT within the overall query the usual idf
scheme applies. To clarify:
If I search for:
the cheapest accomodation~ in london
I want to see the most common spellings of
accommodation before all other variants of this word
BUT I then want these variants scored against the
OTHER words ("in", "the" etc) on the usual basis of
rarity.
This suggests a sort order within another, different
sort order.
This seems like it would not be easy to do. Any bright
ideas?
Cheers
Mark
___________________________________________________________
ALL-NEW Yahoo! Messenger - all new features - even more fun! http://uk.messenger.yahoo.com
---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-dev-help@jakarta.apache.org
|