community-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Kevin A. McGrail (JIRA)" <j...@apache.org>
Subject [jira] [Updated] (COMDEV-260) GSOC 2018 SpamAssassin Bayes Token ID
Date Mon, 05 Feb 2018 04:39:00 GMT

     [ https://issues.apache.org/jira/browse/COMDEV-260?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Kevin A. McGrail updated COMDEV-260:
------------------------------------
    Description: 
>From Diane F Skoll idea (used with permission):

We tokenize inbound messages and store the tokens on the server. In each message, we add links
for doing training. When you click on a training link, the system trains the message based
on the tokens stored on the server. In that way, you are training using exactly the tokens
that the Bayes code saw.

For SA, the key point is a framework to store the Bayesian tokens from the email before delivery
of the email so later, a "this is spam" "this is ham" mechanism can take advantage of that
information without having the entire email.

Adding a header with the message id for the storage of the headers allows a framework to be
built for train as spam, train as ham to be more readily built.

The issues you are pointing to have to deal more with the implementation of the this is spam/this
is ham mechanism.

By storing just the tokens, there is less space and privacy & legal concerns are mitigated.

sa-learn would then be extended to use the message id and learn as spam/ham instead of feeding
it the entire message.

 

 

Apache SpamAssassin is a mail filter to identify spam. It is an intelligent email filter which
uses a diverse range of tests to identify unsolicited bulk email, more commonly known as Spam.
These tests are applied to email headers and content to classify email using advanced statistical
methods. 

In addition, SpamAssassin has a modular architecture that allows other technologies to be
quickly wielded against spam and is designed for easy integration into virtually any email
system. 

It is primarily written in Perl with a few bits in C and shell scripts for system integration.

The compendium at https://raptor.pccc.com/raptor.cgim?template=email_spam_compendium is helpful
to understand some of the concepts with SpamAssassin

It will be helpful for a student in this project to understand SMTP but a willingness to learn
and setup your own mail server on a Linux Distribution with SpamAssassin for a personal test
domain will be very desired with assistance provided to get the basic framework for a sandbox
for learning.

As email becomes more commodotized by major providers, knowledge of email systems and their
security is dwindling.  This opportunity can provide real-world experience with an email
security product that is employed by countless commercial systems in the world.

  was:
>From DFS idea used with permission:

We tokenize inbound messages and store the tokens on the server. In each message, we add links
for doing training. When you click on a training link, the system trains the message based
on the tokens stored on the server. In that way, you are training using exactly the tokens
that the Bayes code saw. 

For SA, the key point is a framework to store the Bayesian tokens from the email before delivery
of the email so later, a "this is spam" "this is ham" mechanism can take advantage of that
information without having the entire email.

Adding a header with the message id for the storage of the headers allows a framework to be
built for train as spam, train as ham to be more readily built.

The issues you are pointing to have to deal more with the implementation of the this is spam/this
is ham mechanism.

By storing just the tokens, there is less space and privacy & legal concerns are mitigated.

sa-learn would then be extended to use the message id and learn as spam/ham instead of feeding
it the entire message.


> GSOC 2018 SpamAssassin Bayes Token ID
> -------------------------------------
>
>                 Key: COMDEV-260
>                 URL: https://issues.apache.org/jira/browse/COMDEV-260
>             Project: Community Development
>          Issue Type: Project
>            Reporter: Kevin A. McGrail
>            Priority: Major
>
> From Diane F Skoll idea (used with permission):
> We tokenize inbound messages and store the tokens on the server. In each message, we
add links for doing training. When you click on a training link, the system trains the message
based on the tokens stored on the server. In that way, you are training using exactly the
tokens that the Bayes code saw.
> For SA, the key point is a framework to store the Bayesian tokens from the email before
delivery of the email so later, a "this is spam" "this is ham" mechanism can take advantage
of that information without having the entire email.
> Adding a header with the message id for the storage of the headers allows a framework
to be built for train as spam, train as ham to be more readily built.
> The issues you are pointing to have to deal more with the implementation of the this
is spam/this is ham mechanism.
> By storing just the tokens, there is less space and privacy & legal concerns are
mitigated.
> sa-learn would then be extended to use the message id and learn as spam/ham instead of
feeding it the entire message.
>  
>  
> Apache SpamAssassin is a mail filter to identify spam. It is an intelligent email filter
which uses a diverse range of tests to identify unsolicited bulk email, more commonly known
as Spam. These tests are applied to email headers and content to classify email using advanced
statistical methods. 
> In addition, SpamAssassin has a modular architecture that allows other technologies to
be quickly wielded against spam and is designed for easy integration into virtually any email
system. 
> It is primarily written in Perl with a few bits in C and shell scripts for system integration.
> The compendium at https://raptor.pccc.com/raptor.cgim?template=email_spam_compendium
is helpful to understand some of the concepts with SpamAssassin
> It will be helpful for a student in this project to understand SMTP but a willingness
to learn and setup your own mail server on a Linux Distribution with SpamAssassin for a personal
test domain will be very desired with assistance provided to get the basic framework for a
sandbox for learning.
> As email becomes more commodotized by major providers, knowledge of email systems and
their security is dwindling.  This opportunity can provide real-world experience with an
email security product that is employed by countless commercial systems in the world.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@community.apache.org
For additional commands, e-mail: dev-help@community.apache.org


Mime
View raw message