lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Andrzej Bialecki ...@getopt.org>
Subject Re: NGramSpeller contribution -- Re: combining open office spellchecker with Lucene
Date Wed, 15 Sep 2004 18:29:50 GMT
David Spencer wrote:
> To restate the question for a second.
> 
> The misspelled word is: "conts".
> The sugggestion expected is "const", which seems reasonable enough as 
> it's just a transposition away, thus the string distance is low.
> 
> But - I guess the problem w/ the algorithm is that for short words like 
> this, with transpositions, the two words won't share many ngrams.
> 
> Just looking at 3grams...
> 
> conts -> con ont nts
> const -> con ons nst
> 
> Thus they just share 1 3gram, thus this is why it scores so low. This is 
> an interesting issue, how to tune the algorithm so that it might return 
> words this close higher.

If you added 2-grams, then it would look like this (constructing also 
special start/end grams):

conts -> _c co on nt ts s_
const -> _c co on ns st t_

which gives 50% of overlap.

In another system that I designed we were using a combination of 2-4 
grams, albeit for a slightly different purpose, so in this case it would be:

conts:
	_c co on nt ts s_, _co con ont nts ts_, _con cont onts nts_

const:
	_c co on ns st t_, _co con ons nst st_, _con cons onst nst_

and the overlap is 40%.


> 
> I guess one way is to add all simple transpositions to the lookup table 
> (the "ngram index") so that these could easily be found, with the 
> heuristic that "a frequent way of misspelling words is to transpose two 
> adjacent letters".

Yes, sounds like a good idea. Even though it increases the size of the 
lookup index, it still beats using the linear search...

-- 
Best regards,
Andrzej Bialecki

-------------------------------------------------
Software Architect, System Integration Specialist
CEN/ISSS EC Workshop, ECIMF project chair
EU FP6 E-Commerce Expert/Evaluator
-------------------------------------------------
FreeBSD developer (http://www.freebsd.org)


---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org


Mime
View raw message