spamassassin-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From j.@jmason.org (Justin Mason)
Subject Re: SpamAssassin perceptron curiousity
Date Wed, 07 Sep 2005 18:57:09 GMT
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1


I think Henry needs to comment on this one, he wrote that code.
reaiming directly at Henry ;)

- --j.

felix@crowfix.com writes:
> (Originally posted to users@ but reposted to dev now)
> 
> I got a bit of curiousity in my brain about neural networks, and
> someone suggested I take a look at how SpamAssassin trains itself.  I
> have been looking into .../masses and come across some things which
> set off warning bells.  I don't think I have actually found any bugs,
> but it isn't clear to me what is going on, there are some unused
> variables, and I pathetically justify my intrusion on your time with
> the thought that there *might* be a bug ... :-)
> 
> The code generated in tmp/scores.h by logs-to-c includes these three
> variables:
> 
>     ny_hit[$num_mutable]
>     yn_hit[$num_mutable]
>     lookup[$num_mutable]
> 
> which appear to never be used by either perceptron.c or any generated
> code.
> 
> It also looks like $num_mutable has almost no use; besides setting the
> size of these unused arrays, it governs the weight decay loop, which
> looks to be bypassed under default conditions.
> 
> A bit more poking shows that num_scores in perceptron.c, set from
> $size in logs-to-c, is used for all other array sizes, including the
> weights, and for all related loops, including scaling and printing the
> weights.  What puzzles me is the print loop at the end of write_weights():
> 
>   for (i = 0; i < num_scores; i++) {
>     if ( is_mutable[i] )  {
>       fprintf(fp, "score %-30s %2.3f # [%2.3f..%2.3f]\n", score_names[i], weight_to_score(weights[i]),
range_lo[i], range_hi[i]);
>     } else {
>       fprintf(fp, "score %-30s %2.3f # not mutable\n", score_names[i], range_lo[i]);
>     }
>   }
> 
> The weight decay loop operates only on the first num_mutable entries
> of the weights array, implying that it, and presumably all other
> arrays sized by num_scores, are set up with mutable scores first,
> followed by non-mutable scores.  Thus this loop could be rewritten
> like this:
> 
>   for (i = 0; i < num_scores; i++) {
>     if ( i < num_mutable )  {
>       fprintf(fp, "score %-30s %2.3f # [%2.3f..%2.3f]\n", score_names[i], weight_to_score(weights[i]),
range_lo[i], range_hi[i]);
>     } else {
>       fprintf(fp, "score %-30s %2.3f # not mutable\n", score_names[i], range_lo[i]);
>     }
>   }
> 
> or even like this:
> 
>   for (i = 0; i < num_mutable; i++) {
>     fprintf(fp, "score %-30s %2.3f # [%2.3f..%2.3f]\n", score_names[i], weight_to_score(weights[i]),
range_lo[i], range_hi[i]);
>   }
>   for (; i < num_scores; i++) {
>     fprintf(fp, "score %-30s %2.3f # not mutable\n", score_names[i], range_lo[i]);
>   }
> 
> Is this right?  I have been doing so much Perl recently that C is
> beginning to look funny, like reading Mark Twain after too much
> Charles Dickens.  Redundant variables set off alarm bells in my head.
> If these are redundant, that would be nice to know, and if not
> redundant, the code looks wrong.
> 
> What I am really trying to do is understood the neural network part of
> SpamAssassin and I seem to have gotten sidetracked, as with all fun
> projects :-)  I have gotten hung up on what mutable means for the code
> in .../masses/, and it does not seem particularly clear yet.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.2.5 (GNU/Linux)
Comment: Exmh CVS

iD8DBQFDHzgFMJF5cimLx9ARAmNZAKCgP8JUNEYvfA++dfBhETPLwv5cOwCfZlpz
k+Qca8K/GYcRgwFMVfGBgzI=
=XE1W
-----END PGP SIGNATURE-----


Mime
View raw message