lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From sergiu gordea <gser...@ifit.uni-klu.ac.at>
Subject Re: which HTML parser is better?
Date Tue, 01 Feb 2005 09:54:33 GMT
Jingkang Zhang wrote:

>Three HTML parsers(Lucene web application
>demo,CyberNeko HTML Parser,JTidy) are mentioned in
>Lucene FAQ
>1.3.27.Which is the best?Can it filter tags that are
>auto-created by MS-word 'Save As HTML files' function?
>  
>

maybe you can try this library...

http://htmlparser.sourceforge.net/

I use the following code to get the text from HTML files,
it was not intensively tested, but it works.

import org.htmlparser.Node;
import org.htmlparser.Parser;
import org.htmlparser.util.NodeIterator;
import org.htmlparser.util.Translate;

Parser parser = new Parser(source.getAbsolutePath());
NodeIterator iter = parser.elements();
while (iter.hasMoreNodes()) {
Node element = (Node) iter.nextNode();
//System.out.println("1:" + element.getText());
String text = Translate.decode(element.toPlainTextString());
if (Utils.notEmptyString(text))
writer.write(text);
}

Sergiu

>_________________________________________________________
>Do You Yahoo!?
>150万曲MP3疯狂搜,带您闯入音乐殿堂
>http://music.yisou.com/
>美女明星应有尽有,搜遍美图、艳图和酷图
>http://image.yisou.com
>1G就是1000兆,雅虎电邮自助扩容!
>http://cn.rd.yahoo.com/mail_cn/tag/1g/*http://cn.mail.yahoo.com/event/mail_1g/
>
>---------------------------------------------------------------------
>To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
>For additional commands, e-mail: lucene-user-help@jakarta.apache.org
>
>  
>


---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org


Mime
View raw message