abdera-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From James M Snell <jasn...@gmail.com>
Subject HTML Parser
Date Mon, 14 Jan 2008 20:26:55 GMT
All,

I have some code based on Henri Sivonen's html5 parser that adds HTML 
parsing capabilities to the Abdera api.  For instance,

   URL url = new URL("http://www.snellspace.com");
   Abdera abdera = Abdera.getInstance();
   Parser parser = abdera.getParserFactory().getParser("html");
   Document doc = parser.parse(url.openStream());
   doc.writeTo(System.out);

The parser will repair broken markup and allow it to be accessed using 
the Abdera Element objects.  The two cases where this becomes 
particularly use is...

a) Performing autodiscovery of feeds and atompub service docs
b) Converting HTML content to XHTML content and protecting feeds against
    accidental breakage.

For example,

   List<Element> list =
     HtmlHelper.discoverLinks(
       "http://www.snellspace.com/wp",
       "application/atom+xml",
       "alternate");
   for (Element el : list) {
     String href = el.getAttributeValue("href");
     String title = el.getAttributeValue("title");
     String type = el.getAttributeValue("type");
     System.out.println(type + ", " + title + ", " + href);
   }

And another:

   Abdera abdera = Abdera.getInstance();
   Entry entry = abdera.newEntry();
   entry.setContentAsXhtml(HtmlCleaner.parse("<p>test<br>foo"));
   System.out.println(entry);

Which outputs:

   <entry xmlns="http://www.w3.org/2005/Atom">
     <content type="xhtml">
       <div xmlns="http://www.w3.org/1999/xhtml">
         <p>test<br />foo</p>
       </div>
     </content>
   </entry>

Note that the html fragment is fixed by the HtmlCleaner.

I could commit this but doing so means adding two new optional 
dependency jars.  I think the function is valuable enough to justify the 
addition but I wanted to run it past the rest of you first.

- James

Mime
View raw message