spamassassin-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Shawn Bakhtiar <>
Subject Re: My new method for blocking spam - REVEALED!
Date Wed, 20 Jan 2016 23:48:43 GMT
Set theory is not my strongest suit,  but your diagram looks incorrect:


H be ham
S be spam
E be an email

Than you state that:
HE = (H u E)
SE = (S u E)

But than the next diagram shows that there is some solution in which (HE u SE) and thus there
may be some set which is (HE / SE). Even though in the first diagram S and H do not intersect.

This is not logical. Either (H u S) in which there are tokens common to the ham and spam token
sets, or it does not, so which is it?? in other words, if a token is both ham and spam how
are you calculating it’s weight?? Is it spam or ham?

Clearly it’s the latter (they do not intersect) as described in this:

In which case you are simply looking to see if (H u E) > (S u E) and has nothing to do
with what is not in the set, and there is indeed no (H u S) or the negation or NOT which is
(H / S), so as everyone has been trying to explain it has NOTHING to do with what is NOT matched.

By they way, you can’t match an infinite set (well theoretically but not actually).

Since the current Bayes learns both SPAM and HAM I imagine that it does a very similar thing,
other than perhaps the larger multi word token sets, which seems a trivial thing to add, and
available in other tool sets.

On Jan 20, 2016, at 10:52 AM, Marc Perkel <<>>

Yes - you missed something. It is about intersecting one corpi and NOT intersecting the other.

This is about what doesn't match - not what does.

On 01/20/16 10:26, Shawn Bakhtiar wrote:
Sorry.. how is this different than Naive Bayes filtering??

"Naive Bayes classifiers work by correlating the use of tokens (typically words, or sometimes
other things), with spam and non-spam e-mails and then using Bayes' theorem to calculate a
probability that an email is or is not spam."

"the set of fingerprints of the test message is intersected with the spam and ham corpi creating
sub sets of matches. Then you do a set diff both ways (ham - spam) (spam - ham) and whichever
side is bigger wins. Generally it will match on only one side or very predominately on one
side.” — Marc Perkel

You are still looking up words/phrases in a dictionary set, and coming up with a probability
factor of which side it falls on (an application of Baye’s theorom).

Or did I miss something?

On Jan 20, 2016, at 9:17 AM, Wrolf <<>> wrote:

Good luck with your patent application, it should be in the infinitely elastic queue right
after my perpetual motion machine.

Not sure how you will deal with the number of ham tokens in spam messages. Also not sure how
much ham will get canned as spam - but then, maybe people shouldn't be sending each other

haiku by email
blossoms in my inbox
drink morning coffee



On Wed, Jan 20, 2016 at 11:52 AM, Marc Perkel <<>>
OK - following up on this. I have my provisional patent filed. I'm still doing development
to improve it and working on a licensing contract. But the license will be based on the Creative
Commons patent with some restrictions added. Basically I want to get a license fee from the
big guys and my spam filtering competitors. So unless you are in the spam filtering business
or have more than 10,000 email addresses it's not going to cost you anything.

I'm going to describe the concept here. I'm not going to share my code because my code is
specific to my system and it a combination of bash scripts, redis, pascal, php, and Exim rules.
And the open source programmers are likely to implement it better than I have. Basically I'm
trying not to put myself out of business and this new method is a bigger breakthrough than
Bayesian filtering.

Maybe I should call it a new plan for spam?

So - I'm just going to introduce the concept right now about how it works. Once you know what
I'm doing it should be easy to implement, I had it working in a couple of days and I'm not
an outstanding programmer. One thing to keep in mind is this is a paradigm shift. It's not
about matching - it's about NOT matching. And although it is far better at catching spam,
it best feature is actively identifying good email.

The secret sauce

Suppose I get an email with the subject line "Let's get some lunch". I know it's a good email
because spammers never say "Let's go to lunch". In fact there are an infinite number of words
and phrases that are used in good email that are never ever used in spam. And if I'm using
words and phrases never used in spam that are used in ham - it's good email. And similarly
- if I'm using words and phrases that are used in spam and never used in spam - it's spam.

So - how do I get a list of words and phrases never used in spam? I create a list of words
and phrases that are used in spam and check to see if it's not on the list.

What I do is tokenize the spamiest parts of the email, like the subject line, into words and
phrases of 1 2 3 and 4 word phrases.

the quick brown fox jumps over the lazy dog - becomes

"the" "quick" "the quick" "brown" "quick brown" "the quick brown" "fox" "brown fox" "quick
brown fox" "the quick brown fox" "jumps" "fox jumps" "brown fox jumps" "quick brown fox jumps"
"over" "jumps over" "fox jumps over" "brown fox jumps over" "the" "over the" "jumps over the"
"fox jumps over the" "lazy" "the lazy" "over the lazy" "jumps over the lazy" "dog" "lazy dog"
"the lazy dog" "over the lazy dog"

These tokens are learned as ham or spam and added to sets. I'm using Redis to do this because
it has extremely fast set operations. I don't know of anything other than Redis that can do
this. So think about Redis as the way to implement this.

A new message comes in. It is tokenized and fingerprinted and hundreds of fingerprints are
generated. Then it's all set operations. the set of fingerprints of the test message is intersected
with the spam and ham corpi creating sub sets of matches. Then you do a set diff both ways
(ham - spam) (spam - ham) and whichever side is bigger wins. Generally it will match on only
one side or very predominately on one side.

So I'm not just tokenizing the subject. Also the first 25 words of the message, the text of
links in the message, The name part of the from address, The header names, the attachment
names, the PHP script if there is one, and various behavior characteristics, (slow, no quit,
no RDNS, number on mime parts, multiple recipients, etc.)

SpamAssassin is all about matching rules. This is all about not matching. Not matching allows
you to compare to an infinite set rather than a finite set. So when spammers start misspelling
words to not match the rules, my system catches that and makes its own rules. The tricks that
spammers use not makes it easier to catch them using this method.

I will post a link to a better explanation later when I write one. But wanted to let you all
know this wasn't just a tease from some crazy person.

So - here's what I want to see happen.

I'd like to see SA implement this. I will provide a license to include with it giving most
people a free license. sort of like how Spamhaus isn't free to everyone, but it's in SA. Then
the new method will take off and eventually I'll get a little something for this.

This new method (I'm calling it the Evolution Spam Filter because the algorithm mimics evolution.)
it doesn't just block spammers, it decimates spammers. It's not just a treatment - it's the
cure. I hate spam and although I could have kept this secret and made money having the best
spam filter on the planet, I decided I had a moral obligation to make this generally available.
I think this will save the global economy billions of dollars in recovered productivity and
crime and fraud prevention.

I'm seeing close to 100% accuracy. It is so accurate it's scary and I think my implementation
is crude at best. I think if it were done right it could even get closer to 100% than I have.
Once you wrap your brain around the concept it's almost scary how well it works.

The side effects is this is a very fast and simple recursive learner. What happens is that
as people converse by email it learns more words and phrases about the stuff that people talk
about that are never used in spam. It doesn't have to know what language you are using, it
will learn it on it's own. It's like having SA with 100 million accurate rules where it write
new rules itself.

I will leave you with that and I'll have more later.

Marc Perkel - Sales/Support<><>
Junk Email Filter dot com

View raw message