httpd-modules-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Joshua Marantz <>
Subject Re: how to parse html content in handler
Date Fri, 25 Mar 2011 14:19:43 GMT
mod_pagespeed's event-driven HTML parser is open source, and is written in

parser is tested using HTML from large numbers of web sites.  The build
process for this module ( generates a separate
.a for the HTML parser, although it's got a few dependencies that would need
to be linked in.  These are all included in which is
self-contained but larger.

If there was much interest we could try to try to package up a
self-contained library that would make it easier to call from other modules.

See also libxml2, which has an HTML mode.


On Fri, Mar 25, 2011 at 9:28 AM, MK <> wrote:

> On Thu, 24 Mar 2011 20:10:46 +0800 (CST)
> Whut  Jia <> wrote:
> > Hi,all
> > I want to parse a html content and withdraw some element in myself
> > apache handler.Please ask how to do it. Thanks,
> > Jia
> I think right now the only public C library for parsing html is in the
> venerable and long unmaintained libwww.
> However, I wrote a quick and simple, event driven parser library a few
> months ago -- I have been meaning to open source this on CCAN or
> somewhere but have not gotten around to it, so if you are interested
> you can send me a message directly, I have some basic scraper demos
> etc.   It is not on the scale of libwww -- it is just a low level HTML
> parser -- but I am sure it could do what you want, and you can either
> compile it in or link to with an apache module (it has no further
> dependencies).
> --
> "Enthusiasm is not the enemy of the intellect." (said of Irving Howe)
> "The angel of history[...]is turned toward the past." (Walter Benjamin)

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message