abdera-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From James M Snell <jasn...@gmail.com>
Subject Parsing HTML
Date Mon, 07 Aug 2006 22:01:12 GMT
I've put together a fairly simple HTML->Abdera/Axiom impl based on the
Tagsoup parser [1].  It implements the Abdera Parser interface and
creates a Document<Element> model that represents HTML as well-formed
XHTML content.  Further, it supports the ParseFilter mechansism so we
can filter out unsafe HTML content (e.g. script tags).

For example:

    Parser parser = new HtmlParser();
    ParserOptions options = parser.getDefaultParserOptions();
    options.setParseFilter(new SafeContentWhiteListParseFilter());

    String h = "foo<p style='background-color:blue'>This
<script>alert('foo');</script> <a href='this is foo'>is</a> foo
<b>bar</b> &nbsp;&raquo;&lt;foo&gt; hello";

    ByteArrayInputStream in = new ByteArrayInputStream(h.getBytes());

    Document<Element> doc = parser.parse(in, (URI)null, options);

    doc.getRoot().writeTo(System.out);

// Outputs
<xhtml:div xmlns:xhtml="http://www.w3.org/1999/xhtml">foo<xhtml:p>This
alert('foo'); <xhtml:a href="this is foo" shape="rect">is</xhtml:a> foo
<xhtml:b>bar</xhtml:b>  ยป&lt;foo&gt; hello</xhtml:p></xhtml:div>

There are still little bits of wierdness, but for the most part it seems
to work really well.  On the downside, I'm not sure if the Tagsoup
license is compatible with the Apache license, otherwise I'd check this
in to the extensions module.

(and oh, btw, so far this has been implemented as a single class with
only 189 lines of code, most of which are formatting :-) ....)

- James

[1] http://home.ccil.org/~cowan/XML/tagsoup/


Mime
View raw message