lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From <karl.wri...@nokia.com>
Subject RE: LevenshteinFilter proposal
Date Mon, 26 Jul 2010 17:44:13 GMT
Clearly you haven’t been in the Northeast much.  Try “Worcester” vs. “wuster”, or
“Leominster” vs. “leminster”.  It’s also likely to be a challenge to come up with
the right phonetics for any given proper location name.   It’s even worse in Britain, or
countries where the phonetic rules may be a hodgepodge of different colonial influences.

That having been said, if there exists a “PhoneticQuery” object that does all this using
the automaton logic under the covers, I think it would be  worth a serious look.

Karl


From: ext Robert Muir [mailto:rcmuir@gmail.com]
Sent: Monday, July 26, 2010 1:24 PM
To: dev@lucene.apache.org
Subject: Re: LevenshteinFilter proposal


On Mon, Jul 26, 2010 at 1:13 PM, <karl.wright@nokia.com<mailto:karl.wright@nokia.com>>
wrote:
What I want to capture is situations where people misspell things in roughly a phonetic way.
 For example, “Tchaikovsky Avenue” might be misspelled as “Chicovsky Avenue”.  Modules
that do phonetic mapping are possible but you’d have to somehow generate a phonetic database
of (say) streetnames, worldwide.  Good luck on getting hold of that kind of data anywhere.
;-)  In the absence of such data, an LD distance will have to do – but it will almost certainly
need to be greater than 2.
I added this to 'TestPhoneticFilter' and it passes:  assertAlgorithm(new DoubleMetaphone(),
false, "Tchaikovsky Chicovsky", new String[] { "XKFS", "XKFS" });

So if you want to give me all your street names, i can sell you a phonetic database, or you
can use the filters in modules/analyzers/phonetic, which have a bunch of different configurable
algorithms :)

--
Robert Muir
rcmuir@gmail.com<mailto:rcmuir@gmail.com>
Mime
View raw message