lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Lukas Zapletal <l...@root.cz>
Subject Re: Best HTML Parser !!
Date Mon, 24 Feb 2003 19:28:50 GMT
Pierre Lacchini wrote:

>Hello,
> 
>i'm trying to index html file with Lucene.
>Do u know what's the best HTML Parser in Java ? 
>The most Powerful ?
>I need to extract meta-tag, and many other differents text fields...
> 
>Thx for ur help ;)
>
>  
>
I have some good experiences with JTidy. It works like DOM-XML parser 
and cleans HTML it by the way.
This is VERY useful, because EVERY HTML have at least ONE error.

Documents that was unparsable with Neko JTidy parsed without problems.

Creating indexing program was work for 2 hours.

-- 
Lukas Zapletal      [lzap@root.cz]
http://www.tanecni-olomouc.cz/lzap




---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org


Mime
View raw message