lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Cam Bazz" <camb...@gmail.com>
Subject Re: string similarity measures
Date Thu, 04 Sep 2008 14:07:09 GMT
let me rephrase the problem. I already have a set of bad words. I want to
avoid people inputting typos of the bad words.
for example 'shit' is banned, but someone may enter sh1t.

how can i flag those phonetically similar bad words to the marked bad words?

Best.

On Thu, Sep 4, 2008 at 5:02 PM, Karl Wettin <karl.wettin@gmail.com> wrote:

>
> 4 sep 2008 kl. 15.54 skrev Cam Bazz:
>
>  yes, I already have a system for users reporting words. they fall on an
>> operator screen and if operator approves, or if 3 other people marked it
>> as
>> curse, then it is filtered.
>> in the other thread you wrote:
>>
>>  I would create 1-5 ngram sized shingles and measure the distance using
>>>
>> Tanimoto coefficient. That would probably work out just fine. ?>You might
>> want to add more weight the greater the size of the shingle.
>>
>>>
>>> There are shingle filters in lucene/java/contrib/analyzers and there is a
>>>
>> Tanimoto distance in lucene/mahout/.
>>
>> would that apply to my case? tanimoto coefficient over shingles?
>>
>
> Not really, no.
>
>
>     karl
>
>
>
>
>>
>> Best,
>>
>>
>> On Thu, Sep 4, 2008 at 4:12 PM, Karl Wettin <karl.wettin@gmail.com>
>> wrote:
>>
>>
>>> 4 sep 2008 kl. 14.38 skrev Cam Bazz:
>>>
>>>
>>> Hello,
>>>
>>>> This came up before but - if we were to make a swear word filter, string
>>>> edit distances are no good. for example words like `shot` is confused
>>>> with
>>>> `shit`. there is also problem with words like hitchcock. appearently i
>>>> need
>>>> something like soundex or double metaphone. the thing is - these are
>>>> language specific, and i am not operating in english.
>>>>
>>>> I need a fuzzy like curse word filter for turkish, simply.
>>>>
>>>>
>>> You probably need to make a large list of words. I would try to learn
>>> from
>>> the users that do swear, perhaps even trust my users to report each
>>> other. I
>>> would probably also look at storing in what context the word is used,
>>> perhaps by adding the surrounding words (ngrams, shingles, markov
>>> chains).
>>> Compare "go to hell" and "when hell frezes over". The first is rather
>>> derogatory while the second doen't have to be bad at all.
>>>
>>> I'm thinking Hidden Markov Models and Neural Networks.
>>>
>>>
>>>        karl
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>>
>>>
>>>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message