lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ian Soboroff <ian.sobor...@nist.gov>
Subject Re: which HTML parser is better?
Date Fri, 04 Feb 2005 15:36:56 GMT

Oops.  It's in the Google cache and also the Internet Archive Wayback
machine.  I'll drop the original author a note to let him know that
his links are stale.

http://web.archive.org/web/20040208014740/http://www.ltg.ed.ac.uk/~richard/ftp-area/html-parser/

Ian

"Karl Koch" <TheRanger@gmx.net> writes:

> The link does not work.
>
>> 
>> One which we've been using can be found at:
>> http://www.ltg.ed.ac.uk/~richard/ftp-area/html-parser/
>> 
>> We absolutely need to be able to recover gracefully from malformed
>> HTML and/or SGML.  Most of the nicer SAX/DOM/TLA parsers out there
>> failed this criterion when we started our effort.  The above one is
>> kind of SAX-y but doesn't fall over at the sight of a real web page
>> ;-)



---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org


Mime
View raw message