ponymail-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Daniel Gruno <humbed...@apache.org>
Subject Re: [incubator-ponymail-foal] branch master updated: Use raw_msg instead of as_bytes, as the latter has a new archived-at appended sometimes (and we don't want that)
Date Tue, 25 Aug 2020 17:49:06 GMT
On 25/08/2020 19.43, sebb wrote:
> On Tue, 25 Aug 2020 at 17:47, Daniel Gruno <humbedooh@apache.org> wrote:
>>
>> On 25/08/2020 18.37, sebb wrote:
>>> -1
>>>
>>> This will likely change the generated ID for emails that already have
>>> archived-at headers
>>
>> No, it will not change that, as they already will have it. This is
>> fixing an issue with randomness in the archiving.
>>
>> See
>> https://github.com/apache/incubator-ponymail-foal/blob/master/tools/archiver.py#L816
> 
> Yes, that adds an archived-at header if the mail does not already have one.
> Since this is added before the generator is called, it may cause
> generators to produce different output.
> This particularly affects the full generator.

Right, if I have an email without the archived-at header, then reloading 
that will produce a different result every time. That's what I was 
hoping to avoid.

> 
> If the message already has an archived-at header, then the full
> generator output should be stable, provided it is given the same
> input.
> 
> But if the input is switched to raw bytes from the reconstructed
> message, this will most likely change it, unnecessarily changing at
> least some ids.

As I just have discovered, this is in fact happening, sort of, because 
.as_bytes does (unintended) normalization of certain headers :\

I guess we'll have to rely on .as_bytes to never change that behavior 
instead, for the full generator. Unrelated to your -1, but still 
annoying as heck.

I've also noticed that the unit tests are spitting out a toooon of error 
messages that I can't quite understand. unrelated to the test results.

> 
>>
>> What happens is, an archived-at with *the current timestamp* will get
>> added to all messages that do not have it, so it's not really helping at
>> all with anything when you use the .as_bytes, as that data will always
>> be different unless there already is such a header.
>>
>> The raw_msg will deliver a much more reliable result going forward, as
>> it doesn't add "random" data to the mix on each reload (because it is
>> the original raw data without headers appended on the fly).
>>
>> I hope this clears matters up.
>>
>>>
>>> Far better to ensure the archived-at header is not added to the parsed
>>> mail before the generator is used.
>>>
>>> On Tue, 25 Aug 2020 at 17:32, <humbedooh@apache.org> wrote:
>>>>
>>>> This is an automated email from the ASF dual-hosted git repository.
>>>>
>>>> humbedooh pushed a commit to branch master
>>>> in repository https://gitbox.apache.org/repos/asf/incubator-ponymail-foal.git
>>>>
>>>>
>>>> The following commit(s) were added to refs/heads/master by this push:
>>>>        new 6dfb9d8  Use raw_msg instead of as_bytes, as the latter has a
new archived-at appended sometimes (and we don't want that)
>>>> 6dfb9d8 is described below
>>>>
>>>> commit 6dfb9d83b1fa6e0ae83bc61446a27fe555751f8c
>>>> Author: Daniel Gruno <humbedooh@apache.org>
>>>> AuthorDate: Tue Aug 25 18:32:33 2020 +0200
>>>>
>>>>       Use raw_msg instead of as_bytes, as the latter has a new archived-at
appended sometimes (and we don't want that)
>>>> ---
>>>>    tools/plugins/generators.py | 8 ++++----
>>>>    1 file changed, 4 insertions(+), 4 deletions(-)
>>>>
>>>> diff --git a/tools/plugins/generators.py b/tools/plugins/generators.py
>>>> index 949a3b6..344f115 100644
>>>> --- a/tools/plugins/generators.py
>>>> +++ b/tools/plugins/generators.py
>>>> @@ -151,7 +151,7 @@ def dkim(_msg, _body, lid, _attachments, raw_msg):
>>>>    # Full generator: uses the entire email (including server-dependent data)
>>>>    # Used by default until August 2020.
>>>>    # See 'dkim' for recommended generation.
>>>> -def full(msg, _body, lid, _attachments, _raw_msg):
>>>> +def full(_msg, _body, lid, _attachments, raw_msg):
>>>>        """
>>>>        Full generator: uses the entire email
>>>>        (including server-dependent data)
>>>> @@ -159,15 +159,15 @@ def full(msg, _body, lid, _attachments, _raw_msg):
>>>>        but different copies of the message are likely to have different headers,
thus ids
>>>>
>>>>        Parameters:
>>>> -    msg - the parsed message
>>>> +    msg - the parsed message (not used)
>>>>        _body - the parsed text content (not used)
>>>>        lid - list id
>>>>        _attachments - list of attachments (not used)
>>>> -    _raw_msg - the original message bytes (not used)
>>>> +    raw_msg - the original message bytes
>>>>
>>>>        Returns: "<hash>@<lid>" where hash is sha224 of message
bytes
>>>>        """
>>>> -    mid = "%s@%s" % (hashlib.sha224(msg.as_bytes()).hexdigest(), lid)
>>>> +    mid = "%s@%s" % (hashlib.sha224(raw_msg).hexdigest(), lid)
>>>>        return mid
>>>>
>>>>
>>>>
>>


Mime
View raw message