lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Chuck Williams" <ch...@manawiz.com>
Subject RE: which HTML parser is better?
Date Tue, 01 Feb 2005 17:17:52 GMT
I think that depends on what you want to do.  The Lucene demo parser does simple mapping of
HTML files into Lucene Documents; it does not give you a parse tree for the HTML doc.  CyberNeko
is an extension of Xerces (uses the same API; will likely become part of Xerces), and so maps
an HTML document into a full DOM that you can manipulate easily for a wide range of purposes.
 I haven't used JTidy at an API level and so don't know it as well -- based on its UI, it
appears to be focused primarily on HTML validation and error detection/correction.

I use CyberNeko for a range of operations on HTML documents that go beyond indexing them in
Lucene, and really like it.  It has been robust for me so far.

Chuck

  > -----Original Message-----
  > From: Jingkang Zhang [mailto:zjingk@yahoo.com.cn]
  > Sent: Tuesday, February 01, 2005 1:15 AM
  > To: lucene-user@jakarta.apache.org
  > Subject: which HTML parser is better?
  > 
  > Three HTML parsers(Lucene web application
  > demo,CyberNeko HTML Parser,JTidy) are mentioned in
  > Lucene FAQ
  > 1.3.27.Which is the best?Can it filter tags that are
  > auto-created by MS-word 'Save As HTML files' function?
  > 
  > _________________________________________________________
  > Do You Yahoo!?
  > 150万曲MP3疯狂搜,带您闯入音乐殿堂
  > http://music.yisou.com/
  > 美女明星应有尽有,搜遍美图、艳图和酷图
  > http://image.yisou.com
  > 1G就是1000兆,雅虎电邮自助扩容!
  > http://cn.rd.yahoo.com/mail_cn/tag/1g/*http://cn.mail.yahoo.com/event/ma
  > il_1g/
  > 
  > ---------------------------------------------------------------------
  > To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
  > For additional commands, e-mail: lucene-user-help@jakarta.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org


Mime
View raw message