spamassassin-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From bugzilla-dae...@bugzilla.spamassassin.org
Subject [Bug 5083] spamd REPORT fails if content preview contains wide characters
Date Tue, 05 Sep 2006 15:33:30 GMT
http://issues.apache.org/SpamAssassin/show_bug.cgi?id=5083





------- Additional Comments From richard+spamassassin@musicbox.net  2006-09-05 15:33 -------
No, that doesn't work unfortunately.  Leaving $msg_resp flagged as native Perl
Unicode causes the length() function to give the number of Unicode characters,
and hence the Content-Length header sent back to the client doesn't match the
number of UTF-8 bytes in the response body.  spamc whinges, although other
clients might either be using protocol 1.2 or below (no Content-Length) or just
ignore the discrepancy and carry on anyway.

How about explicitly turning *off* the :utf8 layer on the socket (ie. stop the
magic that appears to be happening on Linux but not on BSD), and do the UTF-8
conversion as in the original patch?  At least that way we won't be potentially
double-encoding anything.

Or we could use "length(Encode::encode_utf8($msg_resp))" or "use bytes;
length($msg_resp)" to try to ensure the Content-Length matches the data
generated by the :utf8 output layer.  However, that feels even hackier than
doing the conversion ourselves once, and then using the same resulting
byte-string both for Content-Length and for verbatim output.

I agree that in general we should let Perl get on with the details of character
conversion, but one other advantage of doing it ourselves in the *particular*
case of spamd, rather than relying on PerlIO layers, is that we can control the
encoding more precisely.  This would be useful if announcement of the encoding
becomes part of the spamc protocol (ie. Content-Transfer-Encoding).  We can also
be more certain about when the encoding happens, which might be an issue if a
different encoding is needed in each direction: as you pointed out, even with
the current API the best place for a binmode() call is *after* the client
request has been read, since binmode() will apply the translation layer to both
input and output, and it's only output we want to be UTF-8-encoded.




------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.

Mime
View raw message