lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Terence Parr <pa...@jguru.com>
Subject Re: HTML parser
Date Fri, 19 Apr 2002 05:38:28 GMT

On Thursday, April 18, 2002, at 10:28  PM, Otis Gospodnetic wrote:

> Hello,
>
> I need to select an HTML parser for the application that I'm writing
> and I'm not sure what to choose.
> The HTML parser included with Lucene looks flimsy, JTidy looks like a
> hack and an overkill, using classes written for Swing
> (javax.swing.text.html.parser) seems wrong, and I haven't tried David
> McNicol's parser (included with Spindle).
>
> Somebody on this list must have done some research on this subject.
> Can anyone share some experiences?
> Have you found a better HTML parser than any of those I listed above?
> If your application deals with HTML, what do you use for parsing it?

Hi Otis,

I have an HTML parser built for ANTLR, but it's pretty strict in what it 
accepts.  Not sure how useful it will be for you, but here it is:

http://www.antlr.org/grammars/HTML

I am not sure what your goal is, but I personally have to scarf all 
sorts of HTML from various websites to such them into the jGuru search 
engine.  I use a simple stripHTML() method I wrote to handle it.  Works 
great.  Kills everything but the text.  is that the kind of thing you 
are looking for or do you really want to parse not filter?

Terence
--
Co-founder, http://www.jguru.com
Creator, ANTLR Parser Generator: http://www.antlr.org


--
To unsubscribe, e-mail:   <mailto:lucene-user-unsubscribe@jakarta.apache.org>
For additional commands, e-mail: <mailto:lucene-user-help@jakarta.apache.org>


Mime
View raw message