lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Rob Staveley (Tom)" <rstave...@seseit.com>
Subject RE: HTML text extraction
Date Wed, 21 Jun 2006 07:04:13 GMT
I found that CyberNeko left style and script in the text and JTidy produced
better output, but both of them use DOM and were therefore subject to
OutOfMemory errors (JTidy being worse than CyberNeko). I've since then moved
over to TagSoup, which I needed to customise to strip style script (a simple
tweak), but "kept on trucking" with any size document. 

-----Original Message-----
From: Otis Gospodnetic [mailto:otis_gospodnetic@yahoo.com] 
Sent: 21 June 2006 07:37
To: java-user@lucene.apache.org
Subject: Re: HTML text extraction

John,

I also wrote about using NekoHTML, I think.  I prefer that to JTidy.  That
also tells you what Simpy.com uses.

Otis

----- Original Message ----
From: John Wang <john.wang@gmail.com>
To: java-user@lucene.apache.org
Sent: Wednesday, June 21, 2006 1:39:41 AM
Subject: HTML text extraction

Can someone please suggest a HTML text extraction library? In the Lucene
book, it recommends Tidy. Seems jtidy is not really being maintained.

Otis, what do you guys use at Simpy?

Thanks

-john




---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Mime
View raw message