Return-Path: Delivered-To: apmail-xml-general-archive@xml.apache.org Received: (qmail 59102 invoked by uid 500); 15 Feb 2002 01:02:54 -0000 Mailing-List: contact general-help@xml.apache.org; run by ezmlm Precedence: bulk list-help: list-unsubscribe: list-post: Reply-To: general@xml.apache.org Delivered-To: mailing list general@xml.apache.org Received: (qmail 59070 invoked from network); 15 Feb 2002 01:02:53 -0000 Message-ID: <3C6C5E5C.31BA35BE@apache.org> Date: Thu, 14 Feb 2002 17:03:25 -0800 From: Andy Clark X-Mailer: Mozilla 4.78 [en] (Windows NT 5.0; U) X-Accept-Language: en MIME-Version: 1.0 To: xerces-j-dev@xml.apache.org CC: general@xml.apache.org, xerces-j-user@xml.apache.org Subject: Re: [ANNOUNCE] Xerces HTML Parser References: Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 7bit X-OriginalArrivalTime: 15 Feb 2002 01:10:46.0755 (UTC) FILETIME=[993EA330:01C1B5BD] X-Spam-Rating: daedalus.apache.org 1.6.2 0/1000/N Joseph Kesselman/CAM/Lotus wrote: > One question: A huge percentage of the files out there which claim to be > HTML aren't, or at least aren't correct HTML. Browsers are generally very > forgiving and attempt to read past those errors.... but exactly how they > recover varies from browser to browser, so consistancy of respose to those > documents is a problem. Does this HTML prototype attempt that kind of > recovery? Should it? And if it should, does it doecument what approach it's > using so it can be compered with the various browsers and/or W3C's "tidy" > tool? The intention of the NekoHTML parser was to write an example that uses the Xerces Native Interface to show how easily other types of parsers can be written using the framework. But being able to parse HTML files is quite useful beyond just being an example of XNI. One of my goals was to make the parser operate in a serial manner. Because NekoHTML doesn't buffer the document content, it cannot clean up the document as much as a tool like Tidy. However, it uses much less memory than Tidy or other equivalents while being able to fix up most of the common problems. Another benefit of writing the parser using the XNI framework is that the codebase can remain incredibly small. The parser can generate XNI events and work with all of the existing (and future) XNI tools. For example, I don't have to write any code to create DOM, JDOM, or DOM4J trees; emit SAX events; or serialize the document to a file. I just plug it in and it works. But back to your question... I don't claim to clean documents a certain way; the goal is just to produce a balanced well-formed document. This work, though, is done by the tag balancer -- the scanner just tokenizes the input. By separating the tag balancing code into an XNI component in the document pipeline, I could certainly write different kinds of balancers that attempt to clean up the events in their own way. But I don't try to do it the Microsoft IE way or the Netscape Navigator way. However, I should document better *how* I do my particular brand of tag balancing. The parser will not be able to handle incredibly bad HTML documents. But I hope it hits the sweet spot of existing documents. I've run it on a number of major websites that have their own sets of problems (CNN, Slashdot, etc) and it handles them pretty well. So I would like people to try it out and let me know whether it's worth integrating into Xerces-J. -- Andy Clark * andyc@apache.org --------------------------------------------------------------------- In case of troubles, e-mail: webmaster@xml.apache.org To unsubscribe, e-mail: general-unsubscribe@xml.apache.org For additional commands, e-mail: general-help@xml.apache.org