incubator-jspwiki-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Andrew Jaquith <andrew.r.jaqu...@gmail.com>
Subject Spam package redesign
Date Fri, 25 Sep 2009 17:54:55 GMT
** Warning: long post **

After some fooling around and some actual work, I've finished my first
pass at refactoring on the anti-spam code. I'm proposing a new
package, org.apache.wiki.content.inspect, which contains a
general-purpose content-inspection capability, of which spam is just
one potential application. Here is a draft of the package javadocs.

The "inspect" package provides facilities that allow content to be
inspected and scored based on various criteria such as whether a wiki
page change contains spam. Content may be scored by initializing an
Inspection object and then calling the {@link Inspection#inspect()}
method. The {@code inspect} method, in turn, iterates through a
variety of previously-configured {@link Inspector} objects and calls
the {@link Inspector#inspect(String,String)} method for each one. Each
of the configured inspectors is free to perform whatever evaluations
it wishes, and can increase, decrease or reset the "scores" for any
score category, for example the {@link Score#SPAM} category.

Callers can also add to the Inspection "limit" or thresholds for
categories so that inspection stops when one of these limits is
reached. For example, a caller can specify that the Inspection should
stop immediately when Score.SPAM reaches 3. An Inspector can
arbitrarily halt processing by throwing an {@link
InspectionInterruptedException}. The choice of Inspectors to use for
the Inspection, the types of Scores that are modified, and what limits
are placed on scores, are completely configurable.

How to use the classes in this package

The key classes in this package are: Inspection, InspectionContext,
Score, and Inspector.

* The Inspection is the core class in this package. It implements the
Gang-of-Four "strategy" pattern and is itself quite lightweight. It is
initialized by constructing a new Inspection object and supplying the
current WikiContext, InspectionContext, and an array of Inspector
objects that conduct the inspection. The Inspection maintains an
internal running integer total for each Score category, for example
{@link Score#SPAM}. A Score category can be incremented, decremented,
or reset to zero by calling {@link Inspection#changeScore(Score)}. The
integer value for a Score is retrieved at any time by calling {@link
Inspection getScore(Score.Type)}. The initial value is always 0. If a
Score category exceeds its threshold, {@link Inspection#isFailed()}
will return true.

* InspectionContext keeps references to the WikiEngine and other
shared-state objects needed by Inspections and Inspectors. The
InspectionContext persists between HTTP requests and keeps a reference
to the {@link BanList}. It also tracks the IP addresses of hosts that
have modified content recently, along with the changes they have made.
Callers (such as Inspectors) can add a host to the list of recent
modifications by calling
InspectionContext.addModifier(HttpServletRequest,Change). The
InspectionContext is normally initialized just once, as part of the
WikiEngine startup.

* Score objects supply instructions to the parent Inspection object to
increment, decrement or reset the score for a particular category.
Each Score object is constructed with a category (for example,
Score.SPAM), an integer indicating how much to change the score, and
an optional String message that provides context for the change. For
example, a Score that increments the spam score by 1 could be
constructed by new Score( Score.SPAM, 1, "Bot detected." ). Negative
numbers can be supplied also to decrease the score. For convenience,
{@link Score#INCREMENT_SCORE} means "add 1", {@link
Score#DECREMENT_SCORE} means "subtract 1", and {@link Score#RESET}
means "reset to zero."

* The Inspector interface specifies how to implement a particular
strategy for inspecting content. The Inspector has just one primary
method: {@link Inspector#inspect(Inspection, String, Change)}.
Inspectors do just about anything. The Inspection parameter can be
used to obtain the HttpServletRequest and WikiContext. Scores can be
changed by calling {@link Inspection#changeScore(Score)}. Inspectors
that need to terminate processing can throw an {@link
InterruptedInspectionException}. Inspectors are intended to be
instantiated just once, and are initialized via the method {{@link
Inspector#initialize(InspectionContext)}. The InspectionContext
parameter can be used to obtain references to the BanList,
WikiEngine,and WikiEngine Properties. Inspectors are meant to be
re-used and should not keep state between invocations of inspect.
Examples of Inspector implementations include the ChangeRateInspector
(which determines whether the current user has made too many recent
changes) and the LinkCountInspector (which counts links).

Here is an example of how to create and execute an Inspection, modeled
after how SpamFilter does it. SpamFilter's initialize method
constructs the InspectionContext, which will be shared by multiple
Inspections. Notice how the Inspector objects are instantiated and
initialized at this time also.

private InspectionContext m_config = null;

public void initialize( WikiEngine engine, Properties properties )
{
  ...
  m_config = new InspectionContext( engine, properties );
  ...
  m_inspectors = new Inspector[] { new UserInspector(), new
BanListInspector(), ... };
  for( Inspector inspector : m_inspectors )
  {
    inspector.initialize( m_config );
  }
}

Later, when SpamFilter's preSave method executes, a new lightweight
Inspection object is created. The WikiContext, String and Change
parameters supply all the request, content and change information,
respectively, for conducting the inspection:


public String preSave( WikiContext context, String content ) throws
RedirectException
{
  Change change = getChange( context, content );

  // Run the Inspection
  Inspection inspection = new Inspection( context, m_config, m_inspectors );
  inspection.addLimit( Score.SPAM, m_scoreLimit );
  inspection.inspect( content, change );
  int spamScore = inspection.getScore( Score.SPAM );
  context.setVariable( ATTR_SPAMFILTER_SCORE, spamScore );

  // Redirect user if score too high
  if( inspection.isFailed() )
  {
    ...
  }
  ...
}
Finally, notice also how the inspection.addLimit method is used to set
an upper limit for spam scoring. If the limit is reached, the
inspection.isFailed() condition will hold, and the redirection code
(unspecified in this example) will execute.

----

So, that's how it's shaping up so far. I'm quite happy with the
design, although I haven't written unit tests to shake out the bugs.
All of the existing spam-checking logic has been re-factored into the
Inspector classes, which include: BanListInspector, BotTrapInspector,
ChangeRateInspector, LinkCountInspector, SpamInspector (for Akismet),
and UserInspector. I also plan to write a CaptchInspector for
validating Captcha responses. Creating additional Inspectors is easy,
and can be plugged into the array of Inspectors passed to an
Inspection.

I can foresee other uses for this too, for example general-purpose
content classification. But that's for another day.

Comments, thoughts? It's going to take some time to get unit tests
done, so I won't be committing this for a little while.

Andrew

Mime
View raw message