lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Geoff Hendrey" <ghend...@decarta.com>
Subject RE: double metaphone for misspellings
Date Thu, 18 Dec 2008 16:56:27 GMT
I would think that if the place names are English, which those in Boston
would be, then they would be reasonable candidates for soundex and
double metaphone. I am considering an approach where I store SOUNDEX,
refined SOUNDEX, doublemetaphone, and I'll look into ngram as well, and
search against each of those as separate fields, then combine the
results. I'd love to hear more about your experience.

-geoff

-----Original Message-----
From: Max Metral [mailto:max@artsalliancelabs.com] 
Sent: Thursday, December 18, 2008 4:14 AM
To: java-user@lucene.apache.org
Subject: RE: double metaphone for misspellings

Somehow I seem to have missed (and can't find) your original mail, but
it seems like you're asking about using double metaphone for place
names.  We've done this on our site (http://boston.povo.com) for street
and place names, and I can't say we've been happy with the results.
We're toying with ngrams instead, but neither is "well tuned" to the
particular types of mistakes people seem to make for place names, and I
haven't really found or figured out something that is.

-----Original Message-----
From: Daniel Noll [mailto:daniel@nuix.com]
Sent: Thursday, December 18, 2008 12:45 AM
To: java-user@lucene.apache.org
Subject: Re: double metaphone for misspellings

Geoff Hendrey wrote:
> ((POINameType)name).getText().split("\\s"); //tokenize manually.
(gosh,
> I thought the analyser would do this)

The analyser does do this... but related to this, the Right Way to do it

in your case would be to write your own analyser specifically for that
field, and do all the metaphone magic in the analyser.

You probably want an analyser which chains onto the StandardAnalyzer and

adds an additional token filter to do the double metaphone stuff. 
Writing a token filter isn't too hard, PorterStemFilter is a relatively
good example of doing something similar.

So you would end up with a DoubleMetaphoneFilter, which you could then
use with PerFieldAnalyzerWrapper to have it apply only to the fields you

use that for.

Daniel


-- 
Daniel Noll                            Forensic and eDiscovery Software
Senior Developer                              The world's most advanced
Nuix                                                email data analysis
http://nuix.com/                                and eDiscovery software

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message