lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From mathieu <math...@garambrogne.net>
Subject Re: string similarity measures
Date Thu, 04 Sep 2008 14:55:25 GMT

I submitted a patch to handle Aspell phonetic rules. You can find it in JIRA.

On Thu, 4 Sep 2008 17:07:09 +0300, "Cam Bazz" <cambazz@gmail.com> wrote:
> let me rephrase the problem. I already have a set of bad words. I want to
> avoid people inputting typos of the bad words.
> for example 'shit' is banned, but someone may enter sh1t.
> 
> how can i flag those phonetically similar bad words to the marked bad
> words?
> 
> Best.
> 
> On Thu, Sep 4, 2008 at 5:02 PM, Karl Wettin <karl.wettin@gmail.com> wrote:
> 
>>
>> 4 sep 2008 kl. 15.54 skrev Cam Bazz:
>>
>>  yes, I already have a system for users reporting words. they fall on an
>>> operator screen and if operator approves, or if 3 other people marked
> it
>>> as
>>> curse, then it is filtered.
>>> in the other thread you wrote:
>>>
>>>  I would create 1-5 ngram sized shingles and measure the distance using
>>>>
>>> Tanimoto coefficient. That would probably work out just fine. ?>You
> might
>>> want to add more weight the greater the size of the shingle.
>>>
>>>>
>>>> There are shingle filters in lucene/java/contrib/analyzers and there
> is a
>>>>
>>> Tanimoto distance in lucene/mahout/.
>>>
>>> would that apply to my case? tanimoto coefficient over shingles?
>>>
>>
>> Not really, no.
>>
>>
>>     karl
>>
>>
>>
>>
>>>
>>> Best,
>>>
>>>
>>> On Thu, Sep 4, 2008 at 4:12 PM, Karl Wettin <karl.wettin@gmail.com>
>>> wrote:
>>>
>>>
>>>> 4 sep 2008 kl. 14.38 skrev Cam Bazz:
>>>>
>>>>
>>>> Hello,
>>>>
>>>>> This came up before but - if we were to make a swear word filter,
> string
>>>>> edit distances are no good. for example words like `shot` is confused
>>>>> with
>>>>> `shit`. there is also problem with words like hitchcock. appearently
> i
>>>>> need
>>>>> something like soundex or double metaphone. the thing is - these are
>>>>> language specific, and i am not operating in english.
>>>>>
>>>>> I need a fuzzy like curse word filter for turkish, simply.
>>>>>
>>>>>
>>>> You probably need to make a large list of words. I would try to learn
>>>> from
>>>> the users that do swear, perhaps even trust my users to report each
>>>> other. I
>>>> would probably also look at storing in what context the word is used,
>>>> perhaps by adding the surrounding words (ngrams, shingles, markov
>>>> chains).
>>>> Compare "go to hell" and "when hell frezes over". The first is rather
>>>> derogatory while the second doen't have to be bad at all.
>>>>
>>>> I'm thinking Hidden Markov Models and Neural Networks.
>>>>
>>>>
>>>>        karl
>>>>
>>>> ---------------------------------------------------------------------
>>>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>>>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>>>
>>>>
>>>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>
>>
> 
> 


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message