httpd-modules-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Joshua Marantz <jmara...@google.com>
Subject Re: how to parse html content in handler
Date Fri, 25 Mar 2011 14:19:43 GMT
mod_pagespeed's event-driven HTML parser is open source, and is written in
C++:
http://code.google.com/p/modpagespeed/source/browse/trunk/src/net/instaweb/htmlparse/public/html_parse.h

<http://code.google.com/p/modpagespeed/source/browse/trunk/src/net/instaweb/htmlparse/public/html_parse.h>This
parser is tested using HTML from large numbers of web sites.  The build
process for this module (
http://code.google.com/p/modpagespeed/wiki/HowToBuild) generates a separate
.a for the HTML parser, although it's got a few dependencies that would need
to be linked in.  These are all included in mod_pagespeed.so which is
self-contained but larger.

If there was much interest we could try to try to package up a
self-contained library that would make it easier to call from other modules.

See also libxml2, which has an HTML mode.

-Josh

On Fri, Mar 25, 2011 at 9:28 AM, MK <mk@cognitivedissonance.ca> wrote:

> On Thu, 24 Mar 2011 20:10:46 +0800 (CST)
> Whut  Jia <whut_jia@163.com> wrote:
> > Hi,all
> > I want to parse a html content and withdraw some element in myself
> > apache handler.Please ask how to do it. Thanks,
> > Jia
>
> I think right now the only public C library for parsing html is in the
> venerable and long unmaintained libwww.
>
> However, I wrote a quick and simple, event driven parser library a few
> months ago -- I have been meaning to open source this on CCAN or
> somewhere but have not gotten around to it, so if you are interested
> you can send me a message directly, I have some basic scraper demos
> etc.   It is not on the scale of libwww -- it is just a low level HTML
> parser -- but I am sure it could do what you want, and you can either
> compile it in or link to with an apache module (it has no further
> dependencies).
>
>
> --
> "Enthusiasm is not the enemy of the intellect." (said of Irving Howe)
> "The angel of history[...]is turned toward the past." (Walter Benjamin)
>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message