incubator-jspwiki-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Murray Altheim <murra...@altheim.com>
Subject Re: Spam package redesign
Date Fri, 25 Sep 2009 22:17:46 GMT
Andrew Jaquith wrote:
> ** Warning: long post **
> 
> After some fooling around and some actual work, I've finished my first
> pass at refactoring on the anti-spam code. I'm proposing a new
> package, org.apache.wiki.content.inspect, which contains a
> general-purpose content-inspection capability, of which spam is just
> one potential application. Here is a draft of the package javadocs.
[...]
> I can foresee other uses for this too, for example general-purpose
> content classification. But that's for another day.
> 
> Comments, thoughts? It's going to take some time to get unit tests
> done, so I won't be committing this for a little while.

Hi Andrew,

This sounds pretty impressive, all in all. With my library hat on, my
interest was piqued by the idea of using this for non-spam applications,
so the only comment I have at this point is wondering how you might at
this point include the hook into Lucene.

The way I'd see this working would be as follows.

I'd not want to overload the Dublin Core Subject, but as a sort of
informative field that might actually be used to populate the Subject.
The structure of the result of the inspection would be a map of
pseudo-subject (facet?) identifiers and a scope for each, e.g.,

   Subject:         Shipping, Shipwrecks, Transportation
   Pseudo-Subject:  Lusitania                             Score: 0.67
   Pseudo-Subject:  http://en.wikipedia.org/wiki/Titanic  Score: 0.89
   Pseudo-Subject:  Storm                                 Score: 0.56
   Pseudo-Subject:  Mermaid                               Score: 0.24

Where the "pseudo-subject" can be either a string or a URI subject
identifier. And noting that "pseudo-subject" is not a term of art and
I'd hope to come up with something more suitable. One could then use
some mathematically-sensible composite of the scores to obtain the
overall score for the document. You could even choose subsets of the
pseudo-subjects to obtain targeted scores. This would still work for
spam detection but would potentially be very powerful for subject
classification, especially if it was tied into the search functionality.

Does this make any sense?

Murray

Mime
View raw message