lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "shrinath.m" <>
Subject Re: Which is the +best +fast HTML parser/tokenizer that I can use with Lucene for indexing HTML content today ?
Date Tue, 15 Mar 2011 04:46:53 GMT
I started trying out all your suggestions one by one, thanks to all who

I used Jericho and found it extremely simple to start with ...

Just wanted to clarify one thing though.
Is there some tool that does extract text from HTML without creating the DOM


View this message in context:
Sent from the Lucene - Java Users mailing list archive at
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message