lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Alexey Lef <ale...@sciquest.com>
Subject RE: Spell checker
Date Wed, 20 Oct 2004 23:16:40 GMT
If you look at the FuzzyQuery code, it is based on computing Levenshtein
distance between the original term and every term in the index and keeping
the terms that are within the specified relative distance of the original
term. This would explain why FuzzyQuery may work well for small indexes but
for large indexes (I have ~5 million terms in mine) it is impossibly slow.

What n-gram based (or any other secondary index based) spell checkers are
trying to do is to select a limited number of candidate terms in a very
quick manner and then apply the distance algorithm to them. If you use the
same cutoff rules as the FuzzyQuery, you will get a very similar result set.
Secondary index-based spell checkers also give you a lot more control on how
many similar terms to bring back and in what order.

Regards,

Alexey


-----Original Message-----
From: Jonathan Hager [mailto:jhager@gmail.com] 
Sent: Wednesday, October 20, 2004 6:48 PM
To: Lucene Users List
Subject: Re: Spell checker


I investigated how the algorithm implemented in this spell checker
compares with my simple implementation of a spell checker.

First here is what my implementation looks like:

//Each word becomes a single Lucene Document

//To find suggestions:
   FuzzyQuery fquery = new FuzzyQuery(new Term("word", word));
   Hits dicthits = dictionarySearcher.search(fquery);

For a simple test I misspelled brown, as follows:
 * bronw
 * bruwn
 * brownz

To validate my testcases I checked if Microsoft Word and Google had
any idea what I was trying to spell.  Google suggested brown, brown,
browns, respectively.

Words suggestions were:

bronw==>brown, brow
bruwn==>brown, brawn, bruin
brownz==>browns, brown

The suggestions using  David Spencer/Nicolas Maisonneuve's algorithm
against my index were:

bronw==>jaron, brooks, citron, brookline
bruwn==>brush
brownz==>bronze, brooks, brooke, brookline


The suggestions using my real simple algorithm against my index were:

bronw==>brown, brwn, brush
bruwn==>brown, brwn, brush
brownz==>brown, bronze

It appears that  David Spencer/Nicolas Maisonneuve's Spell Checking
Algorithm returns a broader result set than most commercial algorithms
or a real simple algorithm.  I will be the first to say, that this is
just anecdotal evidence and not a rigourous test of either algorithm. 
But until extensive testing has been done I'm going to stick with my
real simple dictionary lookup.

Jonathan

On Wed, 20 Oct 2004 12:56:39 -0400, Aviran <amordo@infosciences.com> wrote:
> Here http://issues.apache.org/bugzilla/showattachment.cgi?attach_id=13009
> 
> Aviran
> http://aviran.mordos.com
> 
> 
> 
> -----Original Message-----
> From: Lynn Li [mailto:lli@hoovers.com]
> Sent: Wednesday, October 20, 2004 10:52 AM
> To: 'Lucene Users List'
> Subject: RE: Spell checker
> 
> Where can I download it?
> 
> Thanks,
> Lynn
> 
> -----Original Message-----
> From: Nicolas Maisonneuve [mailto:nico.maisonneuve@free.fr]
> Sent: Monday, October 11, 2004 1:26 PM
> To: Lucene Users List
> Subject: Spell checker
> 
> hy lucene users
> i developed a Spell checker for lucene inspired by the David Spencer code
> 
> see the wiki doc: http://wiki.apache.org/jakarta-lucene/SpellChecker
> 
> Nicolas Maisonneuve
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
> For additional commands, e-mail: lucene-user-help@jakarta.apache.org
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
> For additional commands, e-mail: lucene-user-help@jakarta.apache.org
> 
>

---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org

---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org


Mime
View raw message