httpd-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Nick Kew <>
Subject Re: suggestion: strip comments from served html pages
Date Sun, 07 Nov 2004 20:22:25 GMT
On Sun, 7 Nov 2004, Cliff Woolley wrote:

> > > Why not introduce and option to remove the
> > > comments from the served file?
> >
> > You have that option from mod_xmlns, mod_proxy_html or mod_publisher,
> > to name but three.  Along with caveats about why it's not necessarily
> > a good idea.
> Ultimately I suspect that the biggest problem (assuming that the "what's a
> comment and what isn't" problem is solvable) is that this kind of parsing
> takes a disproportionately large amount of CPU time with respect to the
> amount of network bandwidth it saves.

Agreed.  The reason those modules offer it is that they're already
parsing the markup, so there's no additional overhead.  mod_include
could offer it for the same reason.

OTOH, the fact that we have - and people use - mod_deflate demonstrates
that there is a demand for byte count reduction.  mod_deflate uses a
great deal more CPU, and achieves a great deal more savings, than any
of the above.

>	  I mean really, how much bandwidth
> from *html* are we talking about?  Compared to images?  If you start
> parsing the html, you lose any ability to do zero-copy and the like, not
> to mention the fact that the CPU has to examine every single byte of the
> input and process and/or copy it.  Yuck.

I implemented that for a Client who is serving slow devices (mobile phones
at 9600 baud and with a lot of latency per connection).  That involved
compressing both HTML and a lot of other contents, including images
(the details are content-negotiated).  My opinion was that the gains
from manipulating HTML were not worth the extra hassle over mod_deflate,
but the Client - one of the best-known names in the business - took
the view that it was worthwhile.

BTW, the "what is a comment" problem is easier than it looks, as both
<script> and <style> are declared in HTML as having CDATA content.
That makes it trivial to distinguish them from "inert" comments.

Nick Kew

View raw message