lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Michael Giles <>
Subject Re: HTML Parsing problems...
Date Mon, 22 Sep 2003 13:42:09 GMT
Yeah, I was using HTMLParser for a few days until I tried to parse a 400K 
document and it spun at 100% CPU for a very long time.  It is tolerant of 
bad HTML, but does not appear to scale.  TagSoup processed the same 
document in a second or less at <25% CPU.


At 02:42 PM 9/22/2003 +0200, you wrote:

>TagSoup is great - however, it is not maintained nor developed (the same 
>could be said about JTidy as well, but TagSoup's history is much 
>shorter...). I'm using HTMLParser ( for 
>my application, and it also works very well, even for ill-formed input. 
>It's also very actively developed.
>Best regards,
>Andrzej Bialecki
>Software Architect, System Integration Specialist
>CEN/ISSS EC Workshop, ECIMF project chair
>EU FP6 E-Commerce Expert/Evaluator

View raw message