lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Michael Giles <mgi...@visionstudio.com>
Subject Re: HTML Parsing problems...
Date Sat, 20 Sep 2003 10:59:31 GMT
Erik,

Probably a good idea to swap something else in, although Neko introduces a 
dependency on Xerces.  I didn't play with Neko because I am currently using 
a different XML parser and didn't want to deal with the conflicts (and also 
find dependencies on specific parsers annoying).  However, yesterday I 
downloaded TagSoup(http://mercury.ccil.org/~cowan/XML/tagsoup/) and it is 
great!  It is small and fast and so far has parsed every page I've thrown 
at it.  I wrote a SAX ContentHandler that only grabs the text and does a 
few other little things (like inserting spaces, removing tabs/line feeds, 
grabbing title) and it seems to be a perfect fit for the job.  It requires 
the SAX framework, but is parser independent.  The only tweak I made to the 
TagSoup code was to add an "else" to deal with a bug where it was consuming 
";" after entities that it did not deal with.

If Neko is potentially headed into the Apache fold, that probably makes 
sense.  But if you are interested in my TagSoup ContentHandler for testing 
it out, just let me know.

-Mike

At 08:08 PM 9/19/2003 -0400, you wrote:
>I'm going to swap in the neko HTML parser for the demo refactorings I'm
>doing.  I would be all for replacing the demo HTML parser with this.
>
>If you look at the Ant <index> task in the sandbox, you'll see that I
>used JTidy for it and it works well, but I've heard that neko is faster
>and better so I'll give it a try.
>
>         Erik
>



Mime
View raw message