lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Jason Polites" <jason.poli...@tpg.com.au>
Subject Re: Inappropriate content detection
Date Mon, 06 Feb 2006 21:31:29 GMT
There is also an open source java anti spam api which does a baysian scan of
email content (plus other stuff).

You could retro-fit to work with raw text.

www.jasen.org

(get the latest HEAD from CVS as the current release is a bit old... new
version imminent)

----- Original Message ----- 
From: "Gwyn Carwardine" <gwyn@carwardine.net>
To: <java-user@lucene.apache.org>
Sent: Tuesday, February 07, 2006 12:58 AM
Subject: RE: Inappropriate content detection


> The good bit about Bayesian is that it continuously learns.
>
> The downside is that you have to teach it.
>
> Not quite as simple as a list of rude words.
>
> There's an open source Bayesian mail filter called spambayes
> (http://spambayes.sourceforge.net) which may lead you to interesting
> places.
>
> -Gwyn
>
> -----Original Message-----
> From: Jeff Thorne [mailto:jeff_thorne@yahoo.com]
> Sent: 06 February 2006 13:30
> To: java-user@lucene.apache.org
> Subject: RE: Inappropriate content detection
>
> The site will have million+ posts. I am not familiar with Bayesian
> algorithms. Is there an off the shelf API that can provide this type of
> capability. As for performance would Bayesian be the way to go over
> Lucene?
>
> Thanks for the help,
> Jeff
>
> -----Original Message-----
> From: gekkokid [mailto:me@gekkokid.org.uk]
> Sent: Sunday, February 05, 2006 8:40 PM
> To: java-user@lucene.apache.org
> Subject: Re: Inappropriate content detection
>
> Hi, what scale is this website? millions of posts or under?
>
> wouldn't it be easiler to use a bayesian algorithm to scan each new post
> before it is posted to detect whether it is acceptable or not? just a
> quick
> idea of my head
>
>
>
> _gk
>
> ----- Original Message ----- 
> From: "Jeff Thorne" <jeff_thorne@yahoo.com>
> To: <java-user@lucene.apache.org>
> Sent: Monday, February 06, 2006 3:56 AM
> Subject: Inappropriate content detection
>
>
>>I am trying to figure out whether or not Lucene is an appropriate solution
>> for a problem that our site faces. Our site
>>
>> allows users to post their opinions on various topics. Due to various
>> government legislations around the world our management would like us to
>> scan each users post against various keywords that would indicate
>> inappropriate content
>>
>> in the users posting. We are looking for racial slurs, profanity and
>> attacks
>> against sexual orientation. Each users posting is
>>
>> generally not more that a few paragraphs.
>>
>>
>>
>> I would like to analyze each users post for various words and expressions
>> before publishing their post to the DB. I am reading through the Lucene
>> in
>> action book and it looks as if I cannot analyze a string without first
>> indexing it. If this is true will indexing each post be a performance hit
>> to
>> the site? I was wondering if someone could shed some light on the best
>> way
>> to tackle this problem with Lucene or another api if doing so makes more
>> sense?
>>
>>
>>
>> Thanks,
>>
>> Jeff
>>
>>
>>
>>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message