Return-Path: Delivered-To: apmail-xml-general-archive@xml.apache.org Received: (qmail 59664 invoked by uid 500); 9 Feb 2002 03:15:04 -0000 Mailing-List: contact general-help@xml.apache.org; run by ezmlm Precedence: bulk list-help: list-unsubscribe: list-post: Reply-To: general@xml.apache.org Delivered-To: mailing list general@xml.apache.org Received: (qmail 59643 invoked from network); 9 Feb 2002 03:15:03 -0000 Message-ID: <3C649458.EAD564CC@apache.org> Date: Fri, 08 Feb 2002 19:15:36 -0800 From: Andy Clark X-Mailer: Mozilla 4.78 [en] (Windows NT 5.0; U) X-Accept-Language: en MIME-Version: 1.0 To: general@xml.apache.org CC: xerces-j-dev@xml.apache.org Subject: [ANNOUNCE] Xerces HTML Parser Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 7bit X-OriginalArrivalTime: 09 Feb 2002 03:22:51.0963 (UTC) FILETIME=[0E8EE8B0:01C1B119] X-Spam-Rating: daedalus.apache.org 1.6.2 0/1000/N For a long time users have asked if Xerces can parse HTML files. But since most HTML documents are not well-formed XML documents, it is generally not possible to use a conforming XML parser to read HTML documents. However, the Xerces Native Interface (XNI) that is the foundation of the Xerces2 implementation defines a framework that allows different kinds of parsers to be constructed by connecting a pipeline of parser components. Therefore, as long as a component can be written that generates the appropriate XNI "events", then it can be used to emit SAX events, build DOM trees, or anything else that you can think of. So, as a fun little exercise, I have written a basic HTML parser using XNI. It consists of an HTML scanner component that can scan HTML files and generate XNI events and a tag balancing component. The tag balancer cleans up the events produced by the scanner, balancing mismatched tags and adding tags where necessary. And it does all of this in a streaming manner to minimize the amount of memory required. Since I wrote the HTML parser as an example of using XNI and because the code is considered alpha quality (but it seems to work quite well, actually!), I am posting the code with a very limited license. Even though it contains the complete source code for the HTML parser, the license only allows the user to experiment but gives no right to actually use the code in a product. If the source isn't "free" or "open", why release it at all? I want to get an idea of what people think of the code first. Then, if there's enough interest, I would like to either donate the code to the Xerces-J project or make it available elsewhere under a true open source license. So, if you've been looking for a way to parse HTML documents please try out the HTML parser and let me know what you think. There should be enough information in the documentation to get you started. Check out the "NekoHTML" project listed on my Apache web site: http://www.apache.org/~andyc/ Have fun! -- Andy Clark * andyc@apache.org --------------------------------------------------------------------- In case of troubles, e-mail: webmaster@xml.apache.org To unsubscribe, e-mail: general-unsubscribe@xml.apache.org For additional commands, e-mail: general-help@xml.apache.org