lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Cam Bazz" <>
Subject Re: string similarity measures
Date Thu, 04 Sep 2008 13:54:14 GMT
yes, I already have a system for users reporting words. they fall on an
operator screen and if operator approves, or if 3 other people marked it as
curse, then it is filtered.
in the other thread you wrote:

>I would create 1-5 ngram sized shingles and measure the distance using
Tanimoto coefficient. That would probably work out just fine. ?>You might
want to add more weight the greater the size of the shingle.
>There are shingle filters in lucene/java/contrib/analyzers and there is a
Tanimoto distance in lucene/mahout/.

would that apply to my case? tanimoto coefficient over shingles?


On Thu, Sep 4, 2008 at 4:12 PM, Karl Wettin <> wrote:

> 4 sep 2008 kl. 14.38 skrev Cam Bazz:
>  Hello,
>> This came up before but - if we were to make a swear word filter, string
>> edit distances are no good. for example words like `shot` is confused with
>> `shit`. there is also problem with words like hitchcock. appearently i
>> need
>> something like soundex or double metaphone. the thing is - these are
>> language specific, and i am not operating in english.
>> I need a fuzzy like curse word filter for turkish, simply.
> You probably need to make a large list of words. I would try to learn from
> the users that do swear, perhaps even trust my users to report each other. I
> would probably also look at storing in what context the word is used,
> perhaps by adding the surrounding words (ngrams, shingles, markov chains).
> Compare "go to hell" and "when hell frezes over". The first is rather
> derogatory while the second doen't have to be bad at all.
> I'm thinking Hidden Markov Models and Neural Networks.
>          karl
> ---------------------------------------------------------------------
> To unsubscribe, e-mail:
> For additional commands, e-mail:

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message