lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From John Haxby <...@scalix.com>
Subject Re: Matching accented with non-accented characters
Date Tue, 25 Jul 2006 19:57:27 GMT
Rajan, Renuka wrote:
> I am trying to match accented characters with non-accented characters in French/Spanish
and other Western European languages.  The use case is that the users may type letters without
accents in error and we still want to be able to retrieve valid matches.  The one idea, albeit
naïve, is to normalize the data on the inbound side as well as the data in the database (prior
to full text indexing) and retrieve matches.  
>   
Look back through the archives a bit for  ISOLatin1AccentFilter.  It 
almost does the job and works reasonably well for western european  
characters.    You'll also find a posting of mine that presents a 
somewhat more complete filter based on the unicode decompositions.   If 
you can't find it I'll dig out the stuff I wrote and re-post it (and 
then maybe some kind soul will add it alongside ISOLatin1AccentFilter).

Eric Jain's comment about "ä" being converted to "a" instead of "ae" is 
a fair one, but it probably doesn't much matter.  Although I have seen 
"Müller" written as both "Muller" and "Mueller" so you're not going to 
be able to please everyone all the time without injecting synonyms and 
being very clever.   And if you're that clever you might catch both 
"encyclopedia" and "encyclopædia" -- the latter converted to 
"encyclopaedia" which isn't the same as "encyclopëdia"!

jch

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message