lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Karl Wettin <karl.wet...@gmail.com>
Subject Re: string similarity measures
Date Thu, 04 Sep 2008 14:02:33 GMT

4 sep 2008 kl. 15.54 skrev Cam Bazz:

> yes, I already have a system for users reporting words. they fall on  
> an
> operator screen and if operator approves, or if 3 other people  
> marked it as
> curse, then it is filtered.
> in the other thread you wrote:
>
>> I would create 1-5 ngram sized shingles and measure the distance  
>> using
> Tanimoto coefficient. That would probably work out just fine. ?>You  
> might
> want to add more weight the greater the size of the shingle.
>>
>> There are shingle filters in lucene/java/contrib/analyzers and  
>> there is a
> Tanimoto distance in lucene/mahout/.
>
> would that apply to my case? tanimoto coefficient over shingles?

Not really, no.


      karl


>
>
> Best,
>
>
> On Thu, Sep 4, 2008 at 4:12 PM, Karl Wettin <karl.wettin@gmail.com>  
> wrote:
>
>>
>> 4 sep 2008 kl. 14.38 skrev Cam Bazz:
>>
>>
>> Hello,
>>> This came up before but - if we were to make a swear word filter,  
>>> string
>>> edit distances are no good. for example words like `shot` is  
>>> confused with
>>> `shit`. there is also problem with words like hitchcock.  
>>> appearently i
>>> need
>>> something like soundex or double metaphone. the thing is - these are
>>> language specific, and i am not operating in english.
>>>
>>> I need a fuzzy like curse word filter for turkish, simply.
>>>
>>
>> You probably need to make a large list of words. I would try to  
>> learn from
>> the users that do swear, perhaps even trust my users to report each  
>> other. I
>> would probably also look at storing in what context the word is used,
>> perhaps by adding the surrounding words (ngrams, shingles, markov  
>> chains).
>> Compare "go to hell" and "when hell frezes over". The first is rather
>> derogatory while the second doen't have to be bad at all.
>>
>> I'm thinking Hidden Markov Models and Neural Networks.
>>
>>
>>         karl
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>
>>


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message