lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Karl Koch" <TheRan...@gmx.net>
Subject RE: which HTML parser is better?
Date Wed, 02 Feb 2005 11:17:18 GMT
Hello,

I have  been following this thread and have another question. 

Is there a piece of sourcecode (which is preferably very short and simple
(KISS)) which allows to remove all HTML tags from HTML content? HTML 3.2
would be enough...also no frames, CSS, etc. 

I do not need to have the HTML strucutre tree or any other structure but
need a facility to clean up HTML into its normal underlying content before
indexing that content as a whole.

Karl


> I think that depends on what you want to do.  The Lucene demo parser does
> simple mapping of HTML files into Lucene Documents; it does not give you a
> parse tree for the HTML doc.  CyberNeko is an extension of Xerces (uses
the
> same API; will likely become part of Xerces), and so maps an HTML document
> into a full DOM that you can manipulate easily for a wide range of
> purposes.  I haven't used JTidy at an API level and so don't know it as
well --
> based on its UI, it appears to be focused primarily on HTML validation and
> error detection/correction.
> 
> I use CyberNeko for a range of operations on HTML documents that go beyond
> indexing them in Lucene, and really like it.  It has been robust for me so
> far.
> 
> Chuck
> 
>   > -----Original Message-----
>   > From: Jingkang Zhang [mailto:zjingk@yahoo.com.cn]
>   > Sent: Tuesday, February 01, 2005 1:15 AM
>   > To: lucene-user@jakarta.apache.org
>   > Subject: which HTML parser is better?
>   > 
>   > Three HTML parsers(Lucene web application
>   > demo,CyberNeko HTML Parser,JTidy) are mentioned in
>   > Lucene FAQ
>   > 1.3.27.Which is the best?Can it filter tags that are
>   > auto-created by MS-word 'Save As HTML files' function?
>   > 
>   > _________________________________________________________
>   > Do You Yahoo!?
>   > 150ÍòÇúMP3·è¿ñËÑ£¬´øÄú´³ÈëÒôÀÖµîÌÃ
>   > http://music.yisou.com/
>   > ÃÀÅ®Ã÷ÐÇÓ¦Óо¡ÓУ¬ËѱéÃÀͼ¡¢ÑÞͼºÍ¿áͼ
>   > http://image.yisou.com
>   > 1G¾ÍÊÇ1000Õ×£¬ÑÅ»¢µçÓÊ×ÔÖúÀ©ÈÝ£¡
>   >
> http://cn.rd.yahoo.com/mail_cn/tag/1g/*http://cn.mail.yahoo.com/event/ma
>   > il_1g/
>   > 
>   > ---------------------------------------------------------------------
>   > To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
>   > For additional commands, e-mail: lucene-user-help@jakarta.apache.org
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
> For additional commands, e-mail: lucene-user-help@jakarta.apache.org
> 

-- 
GMX im TV ... Die Gedanken sind frei ... Schon gesehen?
Jetzt Spot online ansehen: http://www.gmx.net/de/go/tv-spot

---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org


Mime
View raw message