community-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Sebb (JIRA)" <j...@apache.org>
Subject [jira] [Updated] (COMDEV-161) mailglomper.py may count a message multiple times
Date Fri, 25 Sep 2015 11:52:05 GMT

     [ https://issues.apache.org/jira/browse/COMDEV-161?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Sebb updated COMDEV-161:
------------------------
    Description: 
The mailglomper.py script counts messages by matching /Date: (.*)/.
It is looking to match header lines of the form:

Date: Thu, 01 May 2008 05:06:51 +0000

However such lines are not guaranteed to be unique within a message.

In particular SVN commit messages have a "Date:" line which matches, and the parsed timestamp
will be much the same as the header date. For example:

Author: cml
Date: Wed Sep 16 19:06:03 2015
New Revision: 1703436

The mailbox format currently used by the ASF guarantees that each message is prefixed with
a line in the format:

>From user@example.com Thu May 01 05:10:32 2008

[Lines in the message body starting "From " are prefixed as ">From "; the prefix is removed
when messages are extracted]

Only lines starting "From " are guaranteed not to occur in message bodies.

The problem is trivial to fix, but it will change the generated statistics, particularly for
mailboxes that receive SVN commit messages (Git commits use a different prefix for the timestamp).
SVN mails will generally be counted twice.

  was:
The mailglomper.py script counts messages by matching /Date: (.*)/.
It is looking to match header lines of the form:

Date: Thu, 01 May 2008 05:06:51 +0000

However such lines are not guaranteed to be unique within a message.

In particular SVN commit messages have a "Date:" line which matches, and the parsed timestamp
will be much the same as the header date. For example:

Author: cml
Date: Wed Sep 16 19:06:03 2015
New Revision: 1703436

Furthermore, the RE does not anchor the match at the start of a line, this allows further
Date: entries to match.

The mailbox format currently used by the ASF guarantees that each message is prefixed with
a line in the format:

>From user@example.com Thu May 01 05:10:32 2008

[Lines in the message body starting "From " are prefixed as ">From "; the prefix is removed
when messages are extracted]

Only lines starting "From " are guaranteed not to occur in message bodies.

The problem is trivial to fix, but it will change the generated statistics, particularly for
mailboxes that receive SVN commit messages (Git commits use a different prefix for the timestamp).
SVN mails will generally be counted twice.


> mailglomper.py may count a message multiple times
> -------------------------------------------------
>
>                 Key: COMDEV-161
>                 URL: https://issues.apache.org/jira/browse/COMDEV-161
>             Project: Community Development
>          Issue Type: Bug
>          Components: Reporter Tool
>            Reporter: Sebb
>
> The mailglomper.py script counts messages by matching /Date: (.*)/.
> It is looking to match header lines of the form:
> Date: Thu, 01 May 2008 05:06:51 +0000
> However such lines are not guaranteed to be unique within a message.
> In particular SVN commit messages have a "Date:" line which matches, and the parsed timestamp
will be much the same as the header date. For example:
> Author: cml
> Date: Wed Sep 16 19:06:03 2015
> New Revision: 1703436
> The mailbox format currently used by the ASF guarantees that each message is prefixed
with a line in the format:
> From user@example.com Thu May 01 05:10:32 2008
> [Lines in the message body starting "From " are prefixed as ">From "; the prefix is
removed when messages are extracted]
> Only lines starting "From " are guaranteed not to occur in message bodies.
> The problem is trivial to fix, but it will change the generated statistics, particularly
for mailboxes that receive SVN commit messages (Git commits use a different prefix for the
timestamp). SVN mails will generally be counted twice.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message