httpd-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Sam Ruby <ru...@apache.org>
Subject Re: atom feeds for projects
Date Tue, 04 Jul 2006 13:31:10 GMT
Paul Querna wrote:
> robert burrell donkin wrote:
>> On 7/2/06, Sam Ruby <rubys@apache.org> wrote:
>>> robert burrell donkin wrote:
>>>> the mailing list archives at apache run on mod_mbox which also supplies
>>>> atom
>>>> feeds for these lists. i've added the feed from general to the front
>>> page
>>>> and think it'd be cool to add feeds to the pages in projects as well.
>>> since
>>>> the focus of  podlings should be recruiting developers (not users) i'm
>>>> thinking of adding feeds to the dev lists.
>>>>
>>>> opinions?
>>>>
>>>> volunteers?
>>> Just be aware that the feeds produced are rarely well formed XML, mostly
>>> due to encoding issues.  For example: http://tinyurl.com/h5f7t
>>>
>>> I tried to submit a patch based on my limited understanding of the code,
>>> and was told that my patch wasn't acceptable
> 
> To be clear, AFAIK, there was never a patch for mod_mbox -- it was a
> Ruby file that only solved part of the problem. Again, AFAIK, no one
> ever wrote a patch in C for mod_mbox to attempt to resolve this issue.

I offered.  The response was, and I quote, "Erm, no".

>>> and that XML parsers that
>>> require well-formedness were broken anyway -- despite that being
>>> explicitly what the spec requires.
> 
> Its unfortunate that the discussion degraded into that.
> 
>>> I'd be willing to try again, but only if there was active interest in
>>> actually fixing the problem.
> 
> Yes, there is active interest in making mod_box better.

This thread was in October, and since then the feed has not improved.

>> IMO we should fix the feed but i'm not involved with mod_mbox (or httpd).
>> anyone who is want to jump in here?
> 
> The primary bug is lack of encoding support.  mod-mbox just doesn't even
> try to do it.
> 
> Someone needs to write something that touches many parts of the code,
> using the apr_xlate API to convert the content to utf-8.  (This would
> also help it validate as HTML).  Once that is done, we do need to worry
> about out of range characters, some of which would be removed, others
> possibly HTML encoded.

Inside the message, there may be a content-type header.  Inside this 
header, there may be a charset parameter.  This charset parameter may be 
quoted, or it may not.  It may be correct, or it may not.

It would be worthwhile to attempt to extract this, and to attempt to 
convert at least the body portion of the message to utf-8.

But in any case, the results after the conversion need to be sanitized.
The Ruby code that I offered to convert to C does exactly that - takes
something that is allegedly utf-8 and corrects a number of common
errors, and produces something that is guaranteed to be well formed.  Of
course, if you feed in absolute garbage, what you will get back is well
formed line noise.

As promised, here is a C version that does approximately the same thing:

http://intertwingly.net/stories/2006/07/04/clean_utf8_for_xml.c

This may be useful in display_atom_entry, and mbox_static_message, 
mbox_xml_message.  It is safer than using <!CDATA[ ]]> as email messages 
(such as this one) may contain such strings.

Also note that if the content_type of the original MIME message contains 
the string "html", you might want to adjust the type attribute on the 
atom:content element accordingly.

But back to the original point: even if nobody puts in the effort to 
correctly interpret that message based on the specified charset, the 
addition of this code or something similar is (1) necessary anyway, (2) 
will make the result no worse than it currently is and has been for 
months, and (3) will make a marked improvement in that it will correct a 
number of common errors.

Please feel free to treat the code mentioned above as being under the 
Apache Software License version 2.0.  If you don't like my indentation 
or bracing style, by all means, adjust it to your tastes.  Convert the 
malloc to use the appropriate apr call.  Or if you prefer, throw it all 
away, and start over.  I don't care, I just want to see the Atom feeds 
produced to be clean and valid.

> For future discussion of this please use dev@httpd.

OK

> Thanks,
> 
> -Paul

- Sam Ruby

Mime
View raw message