lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Gwyn Carwardine" <g...@carwardine.net>
Subject RE: Inappropriate content detection
Date Mon, 06 Feb 2006 13:58:31 GMT
The good bit about Bayesian is that it continuously learns.

The downside is that you have to teach it.

Not quite as simple as a list of rude words. 

There's an open source Bayesian mail filter called spambayes
(http://spambayes.sourceforge.net) which may lead you to interesting places.

-Gwyn

-----Original Message-----
From: Jeff Thorne [mailto:jeff_thorne@yahoo.com] 
Sent: 06 February 2006 13:30
To: java-user@lucene.apache.org
Subject: RE: Inappropriate content detection

The site will have million+ posts. I am not familiar with Bayesian
algorithms. Is there an off the shelf API that can provide this type of
capability. As for performance would Bayesian be the way to go over Lucene?

Thanks for the help,
Jeff

-----Original Message-----
From: gekkokid [mailto:me@gekkokid.org.uk] 
Sent: Sunday, February 05, 2006 8:40 PM
To: java-user@lucene.apache.org
Subject: Re: Inappropriate content detection

Hi, what scale is this website? millions of posts or under?

wouldn't it be easiler to use a bayesian algorithm to scan each new post 
before it is posted to detect whether it is acceptable or not? just a quick 
idea of my head



_gk

----- Original Message ----- 
From: "Jeff Thorne" <jeff_thorne@yahoo.com>
To: <java-user@lucene.apache.org>
Sent: Monday, February 06, 2006 3:56 AM
Subject: Inappropriate content detection


>I am trying to figure out whether or not Lucene is an appropriate solution
> for a problem that our site faces. Our site
>
> allows users to post their opinions on various topics. Due to various
> government legislations around the world our management would like us to
> scan each users post against various keywords that would indicate
> inappropriate content
>
> in the users posting. We are looking for racial slurs, profanity and 
> attacks
> against sexual orientation. Each users posting is
>
> generally not more that a few paragraphs.
>
>
>
> I would like to analyze each users post for various words and expressions
> before publishing their post to the DB. I am reading through the Lucene in
> action book and it looks as if I cannot analyze a string without first
> indexing it. If this is true will indexing each post be a performance hit 
> to
> the site? I was wondering if someone could shed some light on the best way
> to tackle this problem with Lucene or another api if doing so makes more
> sense?
>
>
>
> Thanks,
>
> Jeff
>
>
>
> 


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org




---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message