lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ivan Krišto <ivan.kri...@gmail.com>
Subject Re: Which is the +best +fast HTML parser/tokenizer that I can use with Lucene for indexing HTML content today ?
Date Fri, 11 Mar 2011 11:29:47 GMT
Hello!

On Fri, Mar 11, 2011 at 12:03 PM, shrinath.m <shrinath.m@webyog.com> wrote:
> I am trying to index content withing certain HTML tags, how do I index it ?
> Which is the best parser/tokenizer available to do this ?

As a general HTML parser I would recommend "Jericho HTML Parser" -
http://jericho.htmlparser.net/docs/index.html
It is pretty robust (this is usually very important) and easy to use
(see provided examples). At the bottom of a given page you have the
short overview of few other HTML parsers.
I haven't compared it's speed with other parsers (I was only searching
for robust parser). But, this parser is neither an event nor tree
based parser (so, even automata theory can help us here).

If you need something pretty specific, like extracting all links from
page, I would recommend you to use simple regular expressions.


  Regards,
    Ivan Krišto

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message