lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From David Spencer <dave-lucene-u...@tropo.com>
Subject Re: NGramSpeller contribution -- Re: combining open office spellchecker with Lucene
Date Wed, 15 Sep 2004 19:57:00 GMT
Andrzej Bialecki wrote:

> David Spencer wrote:
> 
>> To restate the question for a second.
>>
>> The misspelled word is: "conts".
>> The sugggestion expected is "const", which seems reasonable enough as 
>> it's just a transposition away, thus the string distance is low.
>>
>> But - I guess the problem w/ the algorithm is that for short words 
>> like this, with transpositions, the two words won't share many ngrams.
>>
>> Just looking at 3grams...
>>
>> conts -> con ont nts
>> const -> con ons nst
>>
>> Thus they just share 1 3gram, thus this is why it scores so low. This 
>> is an interesting issue, how to tune the algorithm so that it might 
>> return words this close higher.
> 
> 
> If you added 2-grams, then it would look like this (constructing also 
> special start/end grams):

Oh cute trick to indicate prefixes and suffixes.
Anyway, as per prev post I reformed index w/ ngrams from length 2 to 5, 
and also store transpositions, and w/ appropriate boosts :) then "const" 
is returned 2nd.

http://www.searchmorph.com/kat/spell.jsp?s=conts&min=2&max=5&maxd=5&maxr=10&bstart=2.0&bend=1.0&btranspose=10.0&popular=1

> 
> conts -> _c co on nt ts s_
> const -> _c co on ns st t_
> 
> which gives 50% of overlap.
> 
> In another system that I designed we were using a combination of 2-4 
> grams, albeit for a slightly different purpose, so in this case it would 
> be:
> 
> conts:
>     _c co on nt ts s_, _co con ont nts ts_, _con cont onts nts_
> 
> const:
>     _c co on ns st t_, _co con ons nst st_, _con cons onst nst_
> 
> and the overlap is 40%.
> 
> 
>>
>> I guess one way is to add all simple transpositions to the lookup 
>> table (the "ngram index") so that these could easily be found, with 
>> the heuristic that "a frequent way of misspelling words is to 
>> transpose two adjacent letters".
> 
> 
> Yes, sounds like a good idea. Even though it increases the size of the 
> lookup index, it still beats using the linear search...
> 


---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org


Mime
View raw message