lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Trejkaz <trej...@trypticon.org>
Subject Re: Which is the +best +fast HTML parser/tokenizer that I can use with Lucene for indexing HTML content today ?
Date Sat, 12 Mar 2011 09:59:58 GMT
On Fri, Mar 11, 2011 at 10:03 PM, shrinath.m <shrinath.m@webyog.com> wrote:
> I am trying to index content withing certain HTML tags, how do I index it ?
> Which is the best parser/tokenizer available to do this ?

This doesn't really answer the question, but I think it will help...

The features you want to look for:
1. A StAX-like "pull parsing" API - this makes it easier to implement
Reader since Reader is also a pull API.
2. Doesn't try to store the entire HTML file in memory in any form -
this makes it not bomb on gigantic HTML files, which do occur in
reality.

A specific counterexample which fails to satisfy both of these rules
is HTMLParser (htmlparser.sf.net), but be cautious of any library
which doesn't satisfy both.

TX

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message