lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ian Soboroff <ian.sobor...@nist.gov>
Subject Re: which HTML parser is better?
Date Thu, 03 Feb 2005 20:32:06 GMT

One which we've been using can be found at:
http://www.ltg.ed.ac.uk/~richard/ftp-area/html-parser/

We absolutely need to be able to recover gracefully from malformed
HTML and/or SGML.  Most of the nicer SAX/DOM/TLA parsers out there
failed this criterion when we started our effort.  The above one is
kind of SAX-y but doesn't fall over at the sight of a real web page
;-)

Ian


---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org


Mime
View raw message