Return-Path: Delivered-To: apmail-jakarta-lucene-dev-archive@apache.org Received: (qmail 20320 invoked from network); 16 Feb 2002 01:57:06 -0000 Received: from unknown (HELO nagoya.betaversion.org) (192.18.49.131) by daedalus.apache.org with SMTP; 16 Feb 2002 01:57:06 -0000 Received: (qmail 5747 invoked by uid 97); 16 Feb 2002 01:57:13 -0000 Delivered-To: qmlist-jakarta-archive-lucene-dev@jakarta.apache.org Received: (qmail 5724 invoked by uid 97); 16 Feb 2002 01:57:12 -0000 Mailing-List: contact lucene-dev-help@jakarta.apache.org; run by ezmlm Precedence: bulk List-Unsubscribe: List-Subscribe: List-Help: List-Post: List-Id: "Lucene Developers List" Reply-To: "Lucene Developers List" Delivered-To: mailing list lucene-dev@jakarta.apache.org Received: (qmail 5713 invoked from network); 16 Feb 2002 01:57:12 -0000 Reply-To: From: "Paulo Gaspar" To: "Lucene Developers List" Subject: RE: HTMLParser Date: Sat, 16 Feb 2002 03:13:41 +0100 Message-ID: MIME-Version: 1.0 Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit X-Priority: 3 (Normal) X-MSMail-Priority: Normal X-Mailer: Microsoft Outlook IMO, Build 9.0.2416 (9.0.2910.0) X-MimeOLE: Produced By Microsoft MimeOLE V6.00.2600.0000 Importance: Normal In-Reply-To: <20020215204710.10429.qmail@web12702.mail.yahoo.com> X-Spam-Rating: daedalus.apache.org 1.6.2 0/1000/N X-Spam-Rating: daedalus.apache.org 1.6.2 0/1000/N Can the following Xerces based HTML parser be interesting for your work? This is just the initial ANNOUNCE but there are further developments. Have fun, Paulo Gaspar > -----Original Message----- > From: Andy Clark [mailto:andyc@apache.org] > Sent: Saturday, February 09, 2002 4:16 AM > To: general@xml.apache.org > Cc: xerces-j-dev@xml.apache.org > Subject: [ANNOUNCE] Xerces HTML Parser > > > For a long time users have asked if Xerces can parse HTML files. > But since most HTML documents are not well-formed XML documents, > it is generally not possible to use a conforming XML parser to > read HTML documents. > > However, the Xerces Native Interface (XNI) that is the foundation > of the Xerces2 implementation defines a framework that allows > different kinds of parsers to be constructed by connecting a > pipeline of parser components. Therefore, as long as a component > can be written that generates the appropriate XNI "events", then > it can be used to emit SAX events, build DOM trees, or anything > else that you can think of. > > So, as a fun little exercise, I have written a basic HTML parser > using XNI. It consists of an HTML scanner component that can scan > HTML files and generate XNI events and a tag balancing component. > The tag balancer cleans up the events produced by the scanner, > balancing mismatched tags and adding tags where necessary. And > it does all of this in a streaming manner to minimize the amount > of memory required. > > Since I wrote the HTML parser as an example of using XNI and > because the code is considered alpha quality (but it seems to > work quite well, actually!), I am posting the code with a very > limited license. Even though it contains the complete source > code for the HTML parser, the license only allows the user to > experiment but gives no right to actually use the code in a > product. > > If the source isn't "free" or "open", why release it at all? > I want to get an idea of what people think of the code first. > Then, if there's enough interest, I would like to either donate > the code to the Xerces-J project or make it available elsewhere > under a true open source license. > > So, if you've been looking for a way to parse HTML documents > please try out the HTML parser and let me know what you think. > There should be enough information in the documentation to get > you started. Check out the "NekoHTML" project listed on my > Apache web site: http://www.apache.org/~andyc/ > > Have fun! > > -- > Andy Clark * andyc@apache.org > > --------------------------------------------------------------------- > To unsubscribe, e-mail: xerces-j-dev-unsubscribe@xml.apache.org > For additional commands, e-mail: xerces-j-dev-help@xml.apache.org > -- To unsubscribe, e-mail: For additional commands, e-mail: