lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From sergiu gordea <gser...@ifit.uni-klu.ac.at>
Subject Re: which HTML parser is better?
Date Wed, 02 Feb 2005 12:49:14 GMT
  Hi Karl,

 I already submitted a peace of code that removes the html tags.
 Search for my previous answer in this thread.

  Best,

   Sergiu

Karl Koch wrote:

>Hello,
>
>I have  been following this thread and have another question. 
>
>Is there a piece of sourcecode (which is preferably very short and simple
>(KISS)) which allows to remove all HTML tags from HTML content? HTML 3.2
>would be enough...also no frames, CSS, etc. 
>
>I do not need to have the HTML strucutre tree or any other structure but
>need a facility to clean up HTML into its normal underlying content before
>indexing that content as a whole.
>
>Karl
>
>
>  
>
>>I think that depends on what you want to do.  The Lucene demo parser does
>>simple mapping of HTML files into Lucene Documents; it does not give you a
>>parse tree for the HTML doc.  CyberNeko is an extension of Xerces (uses
>>    
>>
>the
>  
>
>>same API; will likely become part of Xerces), and so maps an HTML document
>>into a full DOM that you can manipulate easily for a wide range of
>>purposes.  I haven't used JTidy at an API level and so don't know it as
>>    
>>
>well --
>  
>
>>based on its UI, it appears to be focused primarily on HTML validation and
>>error detection/correction.
>>
>>I use CyberNeko for a range of operations on HTML documents that go beyond
>>indexing them in Lucene, and really like it.  It has been robust for me so
>>far.
>>
>>Chuck
>>
>>  > -----Original Message-----
>>  > From: Jingkang Zhang [mailto:zjingk@yahoo.com.cn]
>>  > Sent: Tuesday, February 01, 2005 1:15 AM
>>  > To: lucene-user@jakarta.apache.org
>>  > Subject: which HTML parser is better?
>>  > 
>>  > Three HTML parsers(Lucene web application
>>  > demo,CyberNeko HTML Parser,JTidy) are mentioned in
>>  > Lucene FAQ
>>  > 1.3.27.Which is the best?Can it filter tags that are
>>  > auto-created by MS-word 'Save As HTML files' function?
>>  > 
>>  > _________________________________________________________
>>  > Do You Yahoo!?
>>  > 150ÍòÇúMP3·è¿ñËÑ£¬´øÄú´³ÈëÒôÀÖµîÌÃ
>>  > http://music.yisou.com/
>>  > ÃÀÅ®Ã÷ÐÇÓ¦Óо¡ÓУ¬ËѱéÃÀͼ¡¢ÑÞͼºÍ¿áͼ
>>  > http://image.yisou.com
>>  > 1G¾ÍÊÇ1000Õ×£¬ÑÅ»¢µçÓÊ×ÔÖúÀ©ÈÝ£¡
>>  >
>>http://cn.rd.yahoo.com/mail_cn/tag/1g/*http://cn.mail.yahoo.com/event/ma
>>  > il_1g/
>>  > 
>>  > ---------------------------------------------------------------------
>>  > To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
>>  > For additional commands, e-mail: lucene-user-help@jakarta.apache.org
>>
>>
>>---------------------------------------------------------------------
>>To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
>>For additional commands, e-mail: lucene-user-help@jakarta.apache.org
>>
>>    
>>
>
>  
>


---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org


Mime
View raw message