lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Paulo Gaspar" <paulo.gas...@krankikom.de>
Subject RE: HTMLParser
Date Sat, 16 Feb 2002 02:13:41 GMT
Can the following Xerces based HTML parser be interesting for 
your work?

This is just the initial ANNOUNCE but there are further 
developments.


Have fun,
Paulo Gaspar

> -----Original Message-----
> From: Andy Clark [mailto:andyc@apache.org]
> Sent: Saturday, February 09, 2002 4:16 AM
> To: general@xml.apache.org
> Cc: xerces-j-dev@xml.apache.org
> Subject: [ANNOUNCE] Xerces HTML Parser
> 
> 
> For a long time users have asked if Xerces can parse HTML files. 
> But since most HTML documents are not well-formed XML documents, 
> it is generally not possible to use a conforming XML parser to 
> read HTML documents. 
> 
> However, the Xerces Native Interface (XNI) that is the foundation 
> of the Xerces2 implementation defines a framework that allows 
> different kinds of parsers to be constructed by connecting a
> pipeline of parser components. Therefore, as long as a component 
> can be written that generates the appropriate XNI "events", then
> it can be used to emit SAX events, build DOM trees, or anything
> else that you can think of.
> 
> So, as a fun little exercise, I have written a basic HTML parser 
> using XNI. It consists of an HTML scanner component that can scan
> HTML files and generate XNI events and a tag balancing component.
> The tag balancer cleans up the events produced by the scanner,
> balancing mismatched tags and adding tags where necessary. And
> it does all of this in a streaming manner to minimize the amount
> of memory required.
> 
> Since I wrote the HTML parser as an example of using XNI and
> because the code is considered alpha quality (but it seems to
> work quite well, actually!), I am posting the code with a very
> limited license. Even though it contains the complete source
> code for the HTML parser, the license only allows the user to
> experiment but gives no right to actually use the code in a 
> product.
> 
> If the source isn't "free" or "open", why release it at all?
> I want to get an idea of what people think of the code first.
> Then, if there's enough interest, I would like to either donate
> the code to the Xerces-J project or make it available elsewhere
> under a true open source license.
> 
> So, if you've been looking for a way to parse HTML documents
> please try out the HTML parser and let me know what you think. 
> There should be enough information in the documentation to get 
> you started. Check out the "NekoHTML" project listed on my
> Apache web site: http://www.apache.org/~andyc/
> 
> Have fun!
> 
> -- 
> Andy Clark * andyc@apache.org
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: xerces-j-dev-unsubscribe@xml.apache.org
> For additional commands, e-mail: xerces-j-dev-help@xml.apache.org
> 

--
To unsubscribe, e-mail:   <mailto:lucene-dev-unsubscribe@jakarta.apache.org>
For additional commands, e-mail: <mailto:lucene-dev-help@jakarta.apache.org>


Mime
View raw message