lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Paul Elschot <paul.elsc...@xs4all.nl>
Subject Re: More fuzzy issues - encouraging bad spelling?
Date Thu, 23 Dec 2004 18:41:33 GMT
Mark,

On Thursday 23 December 2004 14:25, mark harwood wrote:
> Another thought on fuzzy scoring:
> shouldn't all these queries which automatically expand
> terms favour common words over rare ones? The default
> scoring behaviour at the moment favours rare words. As
> a user aren't I more likely to be looking for the most
> common expansions? 
> 
> If I'm not sure how to spell I might search for:
> accomodation~
> or
> accom*
> The fuzzy scoring algorithms will currently favour all
> of the mis-spellings of accommodation in the ranking
> of results because they are more rare.
> 
> Ideally within the expansions of a term the score
> contribution should be based on df (as opposed to the
> usual idf) BUT within the overall query the usual idf
> scheme applies. To clarify:
> If I search for:
>   the cheapest accomodation~ in london
> I want to see the most common spellings of
> accommodation before all other variants of this word
> BUT I then want these variants scored against the
> OTHER words ("in", "the" etc) on the usual basis of
> rarity.
>
> This suggests a sort order within another, different
> sort order.
> This seems like it would not be easy to do. Any bright
> ideas?

The brightest idea I had so far is to drop the idf alltogether.
Idf just doesn't seem to make much sense for terms related
through expansion as fuzzy terms of as truncated terms.

But since dropping idf is probably too controversial,
one solution that uses idf  is to use the minimum idf for
all the expanded terms.
Also the within document frequency for the expanded terms
could be added over these terms before applying tf(),
without a coordination factor as you suggested
in the previous post.
These three measures together would effectively treat
each expanded term as having equal value for scoring.

This would score the most common spellings equal to
the less common ones.

Regards,
Paul Elschot


---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-dev-help@jakarta.apache.org


Mime
View raw message