David Spencer wrote:
> To restate the question for a second.
>
> The misspelled word is: "conts".
> The sugggestion expected is "const", which seems reasonable enough as
> it's just a transposition away, thus the string distance is low.
>
> But - I guess the problem w/ the algorithm is that for short words like
> this, with transpositions, the two words won't share many ngrams.
>
> Just looking at 3grams...
>
> conts -> con ont nts
> const -> con ons nst
>
> Thus they just share 1 3gram, thus this is why it scores so low. This is
> an interesting issue, how to tune the algorithm so that it might return
> words this close higher.
If you added 2-grams, then it would look like this (constructing also
special start/end grams):
conts -> _c co on nt ts s_
const -> _c co on ns st t_
which gives 50% of overlap.
In another system that I designed we were using a combination of 2-4
grams, albeit for a slightly different purpose, so in this case it would be:
conts:
_c co on nt ts s_, _co con ont nts ts_, _con cont onts nts_
const:
_c co on ns st t_, _co con ons nst st_, _con cons onst nst_
and the overlap is 40%.
>
> I guess one way is to add all simple transpositions to the lookup table
> (the "ngram index") so that these could easily be found, with the
> heuristic that "a frequent way of misspelling words is to transpose two
> adjacent letters".
Yes, sounds like a good idea. Even though it increases the size of the
lookup index, it still beats using the linear search...
--
Best regards,
Andrzej Bialecki
-------------------------------------------------
Software Architect, System Integration Specialist
CEN/ISSS EC Workshop, ECIMF project chair
EU FP6 E-Commerce Expert/Evaluator
-------------------------------------------------
FreeBSD developer (http://www.freebsd.org)
---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org
|