spamassassin-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From bugzilla-dae...@bugzilla.spamassassin.org
Subject [Bug 5701] Enhancing SpamAssassin Anti-Phishing Detection Capability
Date Sat, 01 Dec 2007 13:02:15 GMT
http://issues.apache.org/SpamAssassin/show_bug.cgi?id=5701





------- Additional Comments From jm@jmason.org  2007-12-01 05:02 -------
(In reply to comment #4)
> URI detection in plain text is nicely implemented in 
> _get_parsed_uri_list and there are a couple of tests where 
> the functionality is needed for testing anchor text but 
> implemented in a more ad-hoc way.
> 
> One of the PILFER tests is whether the plain text URIs in 
> the anchor text match the target URI.
> 
> It seems like it is possible to implement this functionality
> as a separate function under Utils without losing too much 
> from performance. Am I missing something?

hi Umut --

Take a look at PerMsgStatus::get_uri_detail_list(), that should be very
helpful.  here's the POD doc:

$status->get_uri_detail_list ()

Returns a hash reference of all unique URIs found in the message and
various data about where the URIs were found in the message.  It takes a
combination of the URIs found in the rendered (decoded and HTML stripped)
body and the URIs found when parsing the HTML in the message.  Will also
set $status->{uri_detail_list} (the hash reference as returned by this
function).  This function will also set $status->{uri_domain_count} (count of
unique domains).

The hash format looks something like this:

  raw_uri => {
    types => { a => 1, img => 1, parsed => 1 },
    cleaned => [ canonified_uri ],
    anchor_text => [ "click here", "no click here" ],
    domains => { domain1 => 1, domain2 => 1 },
  }

C<raw_uri> is whatever the URI was in the message itself
(http://spamassassin.apache%2Eorg/).

C<types> is a hash of the HTML tags (lowercase) which referenced
the raw_uri.  I<parsed> is a faked type which specifies that the
raw_uri was seen in the rendered text.

C<cleaned> is an array of the raw and canonified version of the raw_uri
(http://spamassassin.apache%2Eorg/, http://spamassassin.apache.org/).

C<anchor_text> is an array of the anchor text (text between <a> and
</a>), if any, which linked to the URI.

C<domains> is a hash of the domains found in the canonified URIs.


...so the anchor text for each link can be easily found that way.  does that help?



------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.

Mime
View raw message