lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From John Volkar <jvol...@etransport.com>
Subject Ideas, Location Names
Date Wed, 06 Feb 2002 20:51:35 GMT
I'm looking for ideas on how to make searches for geographic location names
more usable.  Let me describe my scenario:  

I have a list of about 600,000 location names in the world.  I can of course
search for exact matches, or using prefix queries (like "New Yor*" to get
"New Yorke" as well as "New York").  

Furthermore I can run all the names thru Soundex and Double Metaphone
algorithms and store and index those strings as well, so I can also apply
those two algorithms to the users input and search for those too.

I suppose that I can also run all the names thru a Porter Stemmer and again
store those strings, and again apply a Porter Stemmer to the users input and
again search.

Basically I could derive 5 strings to search on, and gather the results for
display.  So consider this done.  But is this going to be "good-enough", I
don't know, so...

My problem.  Imagine a hard to spell correctly location name, say
"Abaeteluba" (it's in Brazil) now imagine someone who's trying to look it up
given a verbal pronunciation.  I'm in a hunt for various phonetic encoding
algorithms.  Do you got any?  know of any?  If you do please tell me about
them.

Both Soundex and Double Metaphone are common, but both are optimized for
american english surnames.  The algorithms can be applied to any sort of
word of course, but their accuracy suffers.

I suppose I can use a spell checking type approach and keep a dictionary and
force the user to "spell-check" his input before searching, but...

Any ideas?  I've seen reference to a "k-stemmer" algorithm, but have not
found detailed info on it.  Again, any ideas appreciated.

Thanks

John Volkar


ps: And yes another search mechanism is to use geographic containment and
proximity.  (FIND "ab*" IN "Brazil" AND NEAR "Belem")  But right now I'm
looking for string based lookups only. (To do geographic proximity I need
lat/lon for all 600,000 place names and I do not have that now.)


--
To unsubscribe, e-mail:   <mailto:lucene-user-unsubscribe@jakarta.apache.org>
For additional commands, e-mail: <mailto:lucene-user-help@jakarta.apache.org>


Mime
View raw message