lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Trejkaz <>
Subject Re: Which is the +best +fast HTML parser/tokenizer that I can use with Lucene for indexing HTML content today ?
Date Sat, 12 Mar 2011 09:59:58 GMT
On Fri, Mar 11, 2011 at 10:03 PM, shrinath.m <> wrote:
> I am trying to index content withing certain HTML tags, how do I index it ?
> Which is the best parser/tokenizer available to do this ?

This doesn't really answer the question, but I think it will help...

The features you want to look for:
1. A StAX-like "pull parsing" API - this makes it easier to implement
Reader since Reader is also a pull API.
2. Doesn't try to store the entire HTML file in memory in any form -
this makes it not bomb on gigantic HTML files, which do occur in

A specific counterexample which fails to satisfy both of these rules
is HTMLParser (, but be cautious of any library
which doesn't satisfy both.


To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message