community-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Sebb (JIRA)" <j...@apache.org>
Subject [jira] [Resolved] (COMDEV-163) mailglomper.py takes ages to run
Date Tue, 03 Nov 2015 14:30:27 GMT

     [ https://issues.apache.org/jira/browse/COMDEV-163?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Sebb resolved COMDEV-163.
-------------------------
    Resolution: Fixed

Fixed.

The new script (scripts/mailglomper2.py) creates a json file which caches the weekly stats
for each mailbox. It saves the Last-Modified time so it only needs to reprocess updated mailboxes.
Stale entries are dropped from the cache.

Runtime is now measured in minutes rather than hours.

> mailglomper.py takes ages to run
> --------------------------------
>
>                 Key: COMDEV-163
>                 URL: https://issues.apache.org/jira/browse/COMDEV-163
>             Project: Community Development
>          Issue Type: Bug
>          Components: Reporter Tool
>            Reporter: Sebb
>
> mailglomper takes a very long time to run (several hours)
> This is mainly because it has to download the last 7 mailboxes for each mailing list;
some of these mailboxes can be quite large.
> Most of this is wasted processing because only the mailbox for the current month is ever
updated; once a new month starts, emails are added to the new mailbox only and the earlier
mailboxes are not updated further.
> It would be more efficient to cache the counts/times for the previous months and use
those instead of re-reading them. If the cache entry is missing, then the file is read.
> How much information needs to be cached for each mailbox?
> For exact compatibility with the current code, it would be necessary to store the counts
for each day, but if this results in too much storage, then it would be possible to store
just the weekly counts. This would not affect the historic weekly stats.
> However the running quarterly stats currently allocate the email to the quaterly buckets
on a daily rather than weekly basis, so some precision would be lost if only the weekly merged
counts were available for past months.
> The cache itself would need managing to ensure that the oldest entries were dropped,
otherwise it would grow very large.
> Note: since contributions to the weekly buckets may come from more than one month, it's
likely not feasible to use the existing data. This is because the current month is processed
multiple times, so its data needs to be replaced each time. If its first week overlaps with
the last week of the previous month, that would result in lost data. This problem might even
affect dailiy accumulations; it depends exactly when the mailboxes are flipped. Having a separate
cache entries for each monthly mailbox would also make it easier to manage the cache. The
downside is that it would require more storage, but the cost of re-reading the historic mailboxes
every day is relatively large.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message