lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Jon Crowell" <jcrow...@dsg.harvard.edu>
Subject RE: misspelled queries
Date Thu, 26 Jun 2003 23:05:40 GMT
ASpell is an open source spell checking tool with an API for
C. (I'm afraid I don't know of a C# spell checking API).
ASpell uses a very sophisticated algorithm that begins by
translating the offending word into its soundslike equivalent,
and yields the best results of any spell checking tool I am
aware of.  It is not completely dependent on soundex, however,
so even misspellings that are not close enough to yield the
same soundex code will get good results with ASpell.

http://aspell.sourceforge.net/

Upon coming across a misspelled word you could automatically
run the search using the top (or the top three) spelling
suggestions. Or you could just proved a couple alternative
queries based on the spelling suggestions. If your dictionary
is equal to your index then the suggestions will definitely
yield hits with lucene and will also very likely be what the
user had in mind (because ASpell is amazingly good at finding
the right word).

Now you say that you don't want to use a dictionary but you
do want to deal with misspellings. That seems difficult to me.
Also, relying only on the SoundEx code will leave you high and
dry every time someone makes a minor typo that messes up the
SoundEx code -- "nifraction" instead of "infraction", for
instance.

Jon


> > GSpell is an open source java spell checking API.  It can
> > be found at 
> > http://umlslex.nlm.nih.gov/nlsRepository/gspell/doc/userDoc/
> >
> > It incorporates both metaphone (which is similar to 
> > SoundEx, I think) and ngram algorithms and it is easy to use.
> 
> That might be an option, but I'm using NLucene and C# so 
> porting a full java app is more solution than I'm looking for.
> 
> > I currently have an application in which a user submits a query
> > to Lucene and along the way I use GSpell to check all the terms
> > in the query.  If any are misspelled I underline with a squiggly
> > red line and provide spelling suggestions from GSpell if the
> > user right-clicks.
> >
> > If your spelling correction dictionary is exactly equal to 
> > the terms in your index then any misspelled word is also
> > guaranteed not to yield any hits, and any indexed term is
> > guaranteed not to turn up incorrectly spelled.
> 
> That's not quite what I wanted, actually.  I don't intend to 
> use a dictionary at all.  My hope is that the misspelling 
> should be close enough to the correct spelling that the 
> soundex code would be the same (i.e., spelling and speling 
> and spellling would all have the same soundex code).
> 
> > Jon
> 
> > >
> > > Hi,
> > >
> > > I've been thinking about trying to implement a misspelled or
> > > a similarity match, ala googles "did you mean this ....".  I
> > > was thinking of using SoundEx or one of the newer algorithms
> > > to find appropriate suggestions.  To do this though I think
> > > I would need to enumerate every term in the index, not a
> > > pratical solution I suppose.   Has anyone else attempted this
> > > or had any success with this idea?
> > >
> > > My only other idea would be to generate the SoundEx codes
> > > for every term as its indexed and then add those codes to the
> > > index in a different field. (fyi, here's a link that explains
> > > SoundEx with example code: 
> > > http://www.codeproject.com/csharp/soundex.asp?target=soundex).
> > >
> > > Then the query would search the regular fields and then form
> > > a second soundex'd query and run it on the soundex field.
> > > Does this sound plausible? I'd be really interested to hear
> > > results if anyone has tried this before.
> > >
> > > Regards,
> > > Brian
> > > 
> >


---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org


Mime
View raw message