lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Erik Hatcher" <li...@ehatchersolutions.com>
Subject Re: HTMLParser
Date Sat, 16 Feb 2002 03:40:23 GMT
I'm wondering how this compares to JTidy.  Anyone know?

How does HTMLParser.jj compare to JTidy's capabilities?

    Erik

----- Original Message -----
From: "Paulo Gaspar" <paulo.gaspar@krankikom.de>
To: "Lucene Developers List" <lucene-dev@jakarta.apache.org>
Sent: Friday, February 15, 2002 9:13 PM
Subject: RE: HTMLParser


> Can the following Xerces based HTML parser be interesting for
> your work?
>
> This is just the initial ANNOUNCE but there are further
> developments.
>
>
> Have fun,
> Paulo Gaspar
>
> > -----Original Message-----
> > From: Andy Clark [mailto:andyc@apache.org]
> > Sent: Saturday, February 09, 2002 4:16 AM
> > To: general@xml.apache.org
> > Cc: xerces-j-dev@xml.apache.org
> > Subject: [ANNOUNCE] Xerces HTML Parser
> >
> >
> > For a long time users have asked if Xerces can parse HTML files.
> > But since most HTML documents are not well-formed XML documents,
> > it is generally not possible to use a conforming XML parser to
> > read HTML documents.
> >
> > However, the Xerces Native Interface (XNI) that is the foundation
> > of the Xerces2 implementation defines a framework that allows
> > different kinds of parsers to be constructed by connecting a
> > pipeline of parser components. Therefore, as long as a component
> > can be written that generates the appropriate XNI "events", then
> > it can be used to emit SAX events, build DOM trees, or anything
> > else that you can think of.
> >
> > So, as a fun little exercise, I have written a basic HTML parser
> > using XNI. It consists of an HTML scanner component that can scan
> > HTML files and generate XNI events and a tag balancing component.
> > The tag balancer cleans up the events produced by the scanner,
> > balancing mismatched tags and adding tags where necessary. And
> > it does all of this in a streaming manner to minimize the amount
> > of memory required.
> >
> > Since I wrote the HTML parser as an example of using XNI and
> > because the code is considered alpha quality (but it seems to
> > work quite well, actually!), I am posting the code with a very
> > limited license. Even though it contains the complete source
> > code for the HTML parser, the license only allows the user to
> > experiment but gives no right to actually use the code in a
> > product.
> >
> > If the source isn't "free" or "open", why release it at all?
> > I want to get an idea of what people think of the code first.
> > Then, if there's enough interest, I would like to either donate
> > the code to the Xerces-J project or make it available elsewhere
> > under a true open source license.
> >
> > So, if you've been looking for a way to parse HTML documents
> > please try out the HTML parser and let me know what you think.
> > There should be enough information in the documentation to get
> > you started. Check out the "NekoHTML" project listed on my
> > Apache web site: http://www.apache.org/~andyc/
> >
> > Have fun!
> >
> > --
> > Andy Clark * andyc@apache.org
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: xerces-j-dev-unsubscribe@xml.apache.org
> > For additional commands, e-mail: xerces-j-dev-help@xml.apache.org
> >
>
> --
> To unsubscribe, e-mail:
<mailto:lucene-dev-unsubscribe@jakarta.apache.org>
> For additional commands, e-mail:
<mailto:lucene-dev-help@jakarta.apache.org>
>
>


--
To unsubscribe, e-mail:   <mailto:lucene-dev-unsubscribe@jakarta.apache.org>
For additional commands, e-mail: <mailto:lucene-dev-help@jakarta.apache.org>


Mime
View raw message