Return-Path: Delivered-To: apmail-lucene-java-user-archive@www.apache.org Received: (qmail 14447 invoked from network); 4 Sep 2008 13:54:51 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.2) by minotaur.apache.org with SMTP; 4 Sep 2008 13:54:51 -0000 Received: (qmail 46217 invoked by uid 500); 4 Sep 2008 13:54:42 -0000 Delivered-To: apmail-lucene-java-user-archive@lucene.apache.org Received: (qmail 45867 invoked by uid 500); 4 Sep 2008 13:54:42 -0000 Mailing-List: contact java-user-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: java-user@lucene.apache.org Delivered-To: mailing list java-user@lucene.apache.org Received: (qmail 45855 invoked by uid 99); 4 Sep 2008 13:54:42 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 04 Sep 2008 06:54:42 -0700 X-ASF-Spam-Status: No, hits=2.0 required=10.0 tests=HTML_MESSAGE,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: domain of cambazz@gmail.com designates 209.85.198.226 as permitted sender) Received: from [209.85.198.226] (HELO rv-out-0506.google.com) (209.85.198.226) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 04 Sep 2008 13:53:44 +0000 Received: by rv-out-0506.google.com with SMTP id f6so3990446rvb.5 for ; Thu, 04 Sep 2008 06:54:15 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=domainkey-signature:received:received:message-id:date:from:to :subject:in-reply-to:mime-version:content-type:references; bh=tUsbLq+0iijsK51tZOWUppDQyOFQw0NlfetkhF1nNTY=; b=RE4ddEVqfm8i37BykLCTQsAewFGSsw8icvywQmP2OxM3MjSTMwjdSqmS2X0Ew2d31F rLQla//KvKd9MrKjK9yx3r4LyGVOidYsIIalKmkk4PL0b+3FaouQtnxBIass1QDVKBW6 KtAq7tkyvAnI/MaAD3BOjg4cKZYf17daXKUn8= DomainKey-Signature: a=rsa-sha1; c=nofws; d=gmail.com; s=gamma; h=message-id:date:from:to:subject:in-reply-to:mime-version :content-type:references; b=O4j6Iz5BYuxI2DF+b/5yPALeTJzRvmWVMY+HiuGvBa1wIJ5UzCWwBa0CKU74zZQRaC VEx2CC9GFMzF84fN7I7dSwi0+XDX/gXQagDwdJahUOKLQmFlGpfzGmXDJ+DY5+YsZIUu lk7PSJSBA7qUWx2iogCAeilatyRYCk2lhn0FM= Received: by 10.141.206.13 with SMTP id i13mr5689466rvq.211.1220536455014; Thu, 04 Sep 2008 06:54:15 -0700 (PDT) Received: by 10.141.176.5 with HTTP; Thu, 4 Sep 2008 06:54:14 -0700 (PDT) Message-ID: <1bcb7c7f0809040654u577fd444g2f3dea32c4f1adcd@mail.gmail.com> Date: Thu, 4 Sep 2008 16:54:14 +0300 From: "Cam Bazz" To: java-user@lucene.apache.org Subject: Re: string similarity measures In-Reply-To: MIME-Version: 1.0 Content-Type: multipart/alternative; boundary="----=_Part_50646_9892424.1220536454996" References: <1bcb7c7f0809040538s681fb81aud4e3e2aa19435caf@mail.gmail.com> X-Virus-Checked: Checked by ClamAV on apache.org ------=_Part_50646_9892424.1220536454996 Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: 7bit Content-Disposition: inline yes, I already have a system for users reporting words. they fall on an operator screen and if operator approves, or if 3 other people marked it as curse, then it is filtered. in the other thread you wrote: >I would create 1-5 ngram sized shingles and measure the distance using Tanimoto coefficient. That would probably work out just fine. ?>You might want to add more weight the greater the size of the shingle. > >There are shingle filters in lucene/java/contrib/analyzers and there is a Tanimoto distance in lucene/mahout/. would that apply to my case? tanimoto coefficient over shingles? Best, On Thu, Sep 4, 2008 at 4:12 PM, Karl Wettin wrote: > > 4 sep 2008 kl. 14.38 skrev Cam Bazz: > > > Hello, >> This came up before but - if we were to make a swear word filter, string >> edit distances are no good. for example words like `shot` is confused with >> `shit`. there is also problem with words like hitchcock. appearently i >> need >> something like soundex or double metaphone. the thing is - these are >> language specific, and i am not operating in english. >> >> I need a fuzzy like curse word filter for turkish, simply. >> > > You probably need to make a large list of words. I would try to learn from > the users that do swear, perhaps even trust my users to report each other. I > would probably also look at storing in what context the word is used, > perhaps by adding the surrounding words (ngrams, shingles, markov chains). > Compare "go to hell" and "when hell frezes over". The first is rather > derogatory while the second doen't have to be bad at all. > > I'm thinking Hidden Markov Models and Neural Networks. > > > karl > > --------------------------------------------------------------------- > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org > For additional commands, e-mail: java-user-help@lucene.apache.org > > ------=_Part_50646_9892424.1220536454996--