lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Karl Wettin <>
Subject Re: string similarity measures
Date Thu, 04 Sep 2008 13:12:36 GMT

4 sep 2008 kl. 14.38 skrev Cam Bazz:

> Hello,
> This came up before but - if we were to make a swear word filter,  
> string
> edit distances are no good. for example words like `shot` is  
> confused with
> `shit`. there is also problem with words like hitchcock. appearently  
> i need
> something like soundex or double metaphone. the thing is - these are
> language specific, and i am not operating in english.
> I need a fuzzy like curse word filter for turkish, simply.

You probably need to make a large list of words. I would try to learn  
from the users that do swear, perhaps even trust my users to report  
each other. I would probably also look at storing in what context the  
word is used, perhaps by adding the surrounding words (ngrams,  
shingles, markov chains). Compare "go to hell" and "when hell frezes  
over". The first is rather derogatory while the second doen't have to  
be bad at all.

I'm thinking Hidden Markov Models and Neural Networks.


To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message