lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Michael Giles <>
Subject Re: which HTML parser is better?
Date Tue, 01 Feb 2005 13:48:36 GMT
When I tested parsers a year or so ago for intensive use in Furl, the
best (tolerant of bad HTML) and fastest (tested on a 1.5M HTML page)
parser by far was TagSoup ( ). It is actively
maintained and improved and I have never had any problems with it.


Jingkang Zhang wrote:

>Three HTML parsers(Lucene web application
>demo,CyberNeko HTML Parser,JTidy) are mentioned in
>Lucene FAQ
>1.3.27.Which is the best?Can it filter tags that are
>auto-created by MS-word 'Save As HTML files' function?
>Do You Yahoo!?
>To unsubscribe, e-mail:
>For additional commands, e-mail:

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message