xml-general mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Andy Clark <an...@apache.org>
Subject Re: [ANNOUNCE] Xerces HTML Parser
Date Fri, 15 Feb 2002 01:03:25 GMT
Joseph Kesselman/CAM/Lotus wrote:
> One question: A huge percentage of the files out there which claim to be
> HTML aren't, or at least aren't correct HTML. Browsers are generally very
> forgiving and attempt to read past those errors.... but exactly how they
> recover varies from browser to browser, so consistancy of respose to those
> documents is a problem. Does this HTML prototype attempt that kind of
> recovery? Should it? And if it should, does it doecument what approach it's
> using so it can be compered with the various browsers and/or W3C's "tidy"
> tool?

The intention of the NekoHTML parser was to write an example 
that uses the Xerces Native Interface to show how easily other
types of parsers can be written using the framework. But being 
able to parse HTML files is quite useful beyond just being an
example of XNI. 

One of my goals was to make the parser operate in a serial 
manner. Because NekoHTML doesn't buffer the document content, 
it cannot clean up the document as much as a tool like Tidy. 
However, it uses much less memory than Tidy or other equivalents 
while being able to fix up most of the common problems.

Another benefit of writing the parser using the XNI framework
is that the codebase can remain incredibly small. The parser can
generate XNI events and work with all of the existing (and future)
XNI tools. For example, I don't have to write any code to create
DOM, JDOM, or DOM4J trees; emit SAX events; or serialize the
document to a file. I just plug it in and it works.

But back to your question...

I don't claim to clean documents a certain way; the goal is
just to produce a balanced well-formed document. This work,
though, is done by the tag balancer -- the scanner just 
tokenizes the input. By separating the tag balancing code
into an XNI component in the document pipeline, I could
certainly write different kinds of balancers that attempt
to clean up the events in their own way. But I don't try
to do it the Microsoft IE way or the Netscape Navigator
way. However, I should document better *how* I do my
particular brand of tag balancing.

The parser will not be able to handle incredibly bad HTML
documents. But I hope it hits the sweet spot of existing
documents. I've run it on a number of major websites that
have their own sets of problems (CNN, Slashdot, etc) and
it handles them pretty well.

So I would like people to try it out and let me know 
whether it's worth integrating into Xerces-J.

-- 
Andy Clark * andyc@apache.org

---------------------------------------------------------------------
In case of troubles, e-mail:     webmaster@xml.apache.org
To unsubscribe, e-mail:          general-unsubscribe@xml.apache.org
For additional commands, e-mail: general-help@xml.apache.org


Mime
View raw message