abdera-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Garrett Rooney" <roo...@electricjellyfish.net>
Subject Re: Parsing HTML
Date Mon, 07 Aug 2006 22:08:58 GMT
On 8/7/06, James M Snell <jasnell@gmail.com> wrote:
> I've put together a fairly simple HTML->Abdera/Axiom impl based on the
> Tagsoup parser [1].  It implements the Abdera Parser interface and
> creates a Document<Element> model that represents HTML as well-formed
> XHTML content.  Further, it supports the ParseFilter mechansism so we
> can filter out unsafe HTML content (e.g. script tags).
>
> For example:
>
>     Parser parser = new HtmlParser();
>     ParserOptions options = parser.getDefaultParserOptions();
>     options.setParseFilter(new SafeContentWhiteListParseFilter());
>
>     String h = "foo<p style='background-color:blue'>This
> <script>alert('foo');</script> <a href='this is foo'>is</a> foo
> <b>bar</b> &nbsp;&raquo;&lt;foo&gt; hello";
>
>     ByteArrayInputStream in = new ByteArrayInputStream(h.getBytes());
>
>     Document<Element> doc = parser.parse(in, (URI)null, options);
>
>     doc.getRoot().writeTo(System.out);
>
> // Outputs
> <xhtml:div xmlns:xhtml="http://www.w3.org/1999/xhtml">foo<xhtml:p>This
> alert('foo'); <xhtml:a href="this is foo" shape="rect">is</xhtml:a> foo
> <xhtml:b>bar</xhtml:b>  ยป&lt;foo&gt; hello</xhtml:p></xhtml:div>
>
> There are still little bits of wierdness, but for the most part it seems
> to work really well.  On the downside, I'm not sure if the Tagsoup
> license is compatible with the Apache license, otherwise I'd check this
> in to the extensions module.
>
> (and oh, btw, so far this has been implemented as a single class with
> only 189 lines of code, most of which are formatting :-) ....)

That's pretty slick.  I'll look into the licensing issue.

-garrett

Mime
View raw message