lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Andrzej Bialecki ...@getopt.org>
Subject Re: HTML Parsing problems...
Date Mon, 22 Sep 2003 12:42:02 GMT
Michael Giles wrote:
> Erik,
> 
> Probably a good idea to swap something else in, although Neko introduces 
> a dependency on Xerces.  I didn't play with Neko because I am currently 
> using a different XML parser and didn't want to deal with the conflicts 
> (and also find dependencies on specific parsers annoying).  However, 
> yesterday I downloaded 
> TagSoup(http://mercury.ccil.org/~cowan/XML/tagsoup/) and it is great!  
> It is small and fast and so far has parsed every page I've thrown at 
> it.  I wrote a SAX ContentHandler that only grabs the text and does a 
> few other little things (like inserting spaces, removing tabs/line 
> feeds, grabbing title) and it seems to be a perfect fit for the job.  It 
> requires the SAX framework, but is parser independent.  The only tweak I 
> made to the TagSoup code was to add an "else" to deal with a bug where 
> it was consuming ";" after entities that it did not deal with.

TagSoup is great - however, it is not maintained nor developed (the same 
could be said about JTidy as well, but TagSoup's history is much 
shorter...). I'm using HTMLParser (http://htmlparser.sourceforge.net) 
for my application, and it also works very well, even for ill-formed 
input. It's also very actively developed.

-- 
Best regards,
Andrzej Bialecki

-------------------------------------------------
Software Architect, System Integration Specialist
CEN/ISSS EC Workshop, ECIMF project chair
EU FP6 E-Commerce Expert/Evaluator
-------------------------------------------------
FreeBSD developer (http://www.freebsd.org)




Mime
View raw message