community-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Sebb (JIRA)" <j...@apache.org>
Subject [jira] [Created] (COMDEV-161) mailglomper.py may count a message multiple times
Date Fri, 25 Sep 2015 10:57:04 GMT
Sebb created COMDEV-161:
---------------------------

             Summary: mailglomper.py may count a message multiple times
                 Key: COMDEV-161
                 URL: https://issues.apache.org/jira/browse/COMDEV-161
             Project: Community Development
          Issue Type: Bug
          Components: Reporter Tool
            Reporter: Sebb


The mailglomper.py script counts messages by matching /Date: (.*)/.
It is looking to match header lines of the form:

Date: Thu, 01 May 2008 05:06:51 +0000

However such lines are not guaranteed to be unique within a message.

In particular SVN commit messages have a "Date:" line which matches, and the parsed timestamp
will be much the same as the header date. For example:

Author: cml
Date: Wed Sep 16 19:06:03 2015
New Revision: 1703436

Furthermore, the RE does not anchor the match at the start of a line, this allows further
Date: entries to match.

The mailbox format currently used by the ASF guarantees that each message is prefixed with
a line in the format:

>From user@example.com Thu May 01 05:10:32 2008

[Lines in the message body starting "From " are prefixed as ">From "; the prefix is removed
when messages are extracted]

Only lines starting "From " are guaranteed not to occur in message bodies.

The problem is trivial to fix, but it will change the generated statistics, particularly for
mailboxes that receive SVN commit messages (Git commits use a different prefix for the timestamp).
SVN mails will generally be counted twice.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message