lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Mark Ayad" <m...@javamark.com>
Subject RE: HTML parser
Date Fri, 19 Apr 2002 15:40:09 GMT
You can use the swing html parser to do this but it's only a 3.2 DTD based
parser.
I have written (attached) a totall hack job for braking up an html page into
its
component parts, the code gives you an idea ... If anyone wants to know how
to use
the swing based parser I add some code ?

Mark




-----Original Message-----
From: Otis Gospodnetic [mailto:otis_gospodnetic@yahoo.com]
Sent: 19 April 2002 07:29
To: lucene-user@jakarta.apache.org
Subject: HTML parser


Hello,

I need to select an HTML parser for the application that I'm writing
and I'm not sure what to choose.
The HTML parser included with Lucene looks flimsy, JTidy looks like a
hack and an overkill, using classes written for Swing
(javax.swing.text.html.parser) seems wrong, and I haven't tried David
McNicol's parser (included with Spindle).

Somebody on this list must have done some research on this subject.
Can anyone share some experiences?
Have you found a better HTML parser than any of those I listed above?
If your application deals with HTML, what do you use for parsing it?

Thanks,
Otis


__________________________________________________
Do You Yahoo!?
Yahoo! Tax Center - online filing with TurboTax
http://taxes.yahoo.com/

--
To unsubscribe, e-mail:
<mailto:lucene-user-unsubscribe@jakarta.apache.org>
For additional commands, e-mail:
<mailto:lucene-user-help@jakarta.apache.org>


Mime
View raw message