lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Simon Courtenage <cour...@westminster.ac.uk>
Subject Re: HTML text extraction
Date Wed, 21 Jun 2006 07:14:16 GMT
I also use htmlparser, which is rather good.  I've had to customize it, 
though, to parse strings containing
html source rather than accept urls of resources to fetch etc.  Also it 
crashes on meta tags that don't have
name attributes (something I discovered only a couple of days ago).

Simon

Daniel Noll wrote:
> John Wang wrote:
>> Can someone please suggest a HTML text extraction library? In the Lucene
>> book, it recommends Tidy. Seems jtidy is not really being maintained.
>
> We use this library to do our HTML parsing work:
>
> http://htmlparser.sourceforge.net/
>
> It's fairly resilient to bad code, including things like false 
> assumptions about nesting HTML inside script.  (e.g. 
> document.write("</script>");
>
> Daniel
>


-- 
Dr. Simon Courtenage
Software Systems Engineering Research Group
Dept. of Software Engineering, Cavendish School of Computer Science
University of Westminster, London, UK
Email: courtes@wmin.ac.uk   Web: http://users.cscs.wmin.ac.uk/~courtes | http://www.sse.wmin.ac.uk


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message