httpd-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Sam Ruby <ru...@apache.org>
Subject Re: atom feeds for projects
Date Tue, 04 Jul 2006 14:49:59 GMT
Garrett Rooney wrote:
> On 7/4/06, Sam Ruby <rubys@apache.org> wrote:
>> Garrett Rooney wrote:
>> > On 7/4/06, Sam Ruby <rubys@apache.org> wrote:
>> >
>> >> > To be clear, AFAIK, there was never a patch for mod_mbox -- it was
a
>> >> > Ruby file that only solved part of the problem. Again, AFAIK, no one
>> >> > ever wrote a patch in C for mod_mbox to attempt to resolve this 
>> issue.
>> >>
>> >> I offered.  The response was, and I quote, "Erm, no".
>> >
>> > The "Erm, no" was in response to the approach, not the offer to 
>> help, IIRC.
>> >
>> > If you're willing to fix the problem the right way, by adding real
>> > support for character sets to mod_mbox, I'm sure nobody would have a
>> > problem with  that.
>>
>> You chose to snip the portion where I argue that the approach I outlined
>> is necessary, at least as a fall-back/safety net.  Care to explain why
>> such a fall-back/safety net isn't necessary or appropriate?
> 
> No argument that it's necessary, but it seems kind of pointless to fix
> that part without fixing the underlying fact that mod_mbox is totally
> ignorant of character sets.  You'll get perfectly "valid" junk in the
> vast majority of cases, that doesn't seem like a real step forward to
> me.

"vast majority"?  I beg to differ.

In any case, the current code assumes that everything is valid utf-8. 
And that assumption does not seem to have any indication of changing 
since I posted my offer last October.

For e-mail messages that are either correct UTF-8 or US-ASCII, the 
current code just works.  That's a substantial portion of messages.

With the code I posted, the majority of the messages which are 
iso-8859-1 will be converted to utf-8.  Even if they don't contain the 
proper charset headers.  And if they happen to be "salted" with win-1252 
characters like "smart quotes", those will be corrected too.

I wager that this covers a substantial portion of the non-spam messages 
in your in-box.

And the impact is dramatic.  IE7 won't display any feed that is not well 
formed.  FireFox 2 will stop at the first error.  Bloglines (I'm told, 
but haven't verified) will fall back to a rather sub-optimal RSS parser 
to handle broken Atom feeds - and the results aren't pretty.  Suffice it 
to say (and I say this primarily for Paul's benefit) - I believe that 
either this code, or code that performs a similar function - will 
provide an immediate improvement to Bloglines users who subscribe to 
mod_mbox produced feeds.

As to handing the charset correctly, this can proceed incrementally. 
Parsing the header isn't all that hard.  Fixing the body given the 
charset should be only one call.  Expanding this to the subject and from 
headers (presuming that they, too, are covered by the charset, I haven't 
checked what the specs and/or common practice indicates in this manner) 
can be done at leisure.

I'm willing to help there too.  But I have seen too many emails and too 
many tools that are broken when it comes to encoding to want to invest 
the time in learning how to build and deploy a test version of mod_mbox 
as long as the prevailing mood of the project can be summed up with 
"Erm, no".

- Sam Ruby

Mime
View raw message