lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Earl Hood <e...@earlhood.com>
Subject Re: Which is the +best +fast HTML parser/tokenizer that I can use with Lucene for indexing HTML content today ?
Date Tue, 15 Mar 2011 04:53:24 GMT
On Mon, Mar 14, 2011 at 11:46 PM, shrinath.m <shrinath.m@webyog.com> wrote:
> I used Jericho and found it extremely simple to start with ...
>
> Just wanted to clarify one thing though.
> Is there some tool that does extract text from HTML without creating the DOM

Looks like Jericho does what you want already:
http://jericho.htmlparser.net/docs/javadoc/net/htmlparser/jericho/TextExtractor.html

--ewh

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message