lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Karl Koch" <TheRan...@gmx.net>
Subject Re: which HTML parser is better?
Date Wed, 02 Feb 2005 14:22:52 GMT
Hi,

yes, but the library your are using is quite big. I was thinking that a 5kB
code could actually do that. That sourceforge project is doing much more
than that but I do not need it.

Karl

>   Hi Karl,
> 
>  I already submitted a peace of code that removes the html tags.
>  Search for my previous answer in this thread.
> 
>   Best,
> 
>    Sergiu
> 
> Karl Koch wrote:
> 
> >Hello,
> >
> >I have  been following this thread and have another question. 
> >
> >Is there a piece of sourcecode (which is preferably very short and simple
> >(KISS)) which allows to remove all HTML tags from HTML content? HTML 3.2
> >would be enough...also no frames, CSS, etc. 
> >
> >I do not need to have the HTML strucutre tree or any other structure but
> >need a facility to clean up HTML into its normal underlying content
> before
> >indexing that content as a whole.
> >
> >Karl
> >
> >
> >  
> >
> >>I think that depends on what you want to do.  The Lucene demo parser
> does
> >>simple mapping of HTML files into Lucene Documents; it does not give you
> a
> >>parse tree for the HTML doc.  CyberNeko is an extension of Xerces (uses
> >>    
> >>
> >the
> >  
> >
> >>same API; will likely become part of Xerces), and so maps an HTML
> document
> >>into a full DOM that you can manipulate easily for a wide range of
> >>purposes.  I haven't used JTidy at an API level and so don't know it as
> >>    
> >>
> >well --
> >  
> >
> >>based on its UI, it appears to be focused primarily on HTML validation
> and
> >>error detection/correction.
> >>
> >>I use CyberNeko for a range of operations on HTML documents that go
> beyond
> >>indexing them in Lucene, and really like it.  It has been robust for me
> so
> >>far.
> >>
> >>Chuck
> >>
> >>  > -----Original Message-----
> >>  > From: Jingkang Zhang [mailto:zjingk@yahoo.com.cn]
> >>  > Sent: Tuesday, February 01, 2005 1:15 AM
> >>  > To: lucene-user@jakarta.apache.org
> >>  > Subject: which HTML parser is better?
> >>  > 
> >>  > Three HTML parsers(Lucene web application
> >>  > demo,CyberNeko HTML Parser,JTidy) are mentioned in
> >>  > Lucene FAQ
> >>  > 1.3.27.Which is the best?Can it filter tags that are
> >>  > auto-created by MS-word 'Save As HTML files' function?
> >>  > 
> >>  > _________________________________________________________
> >>  > Do You Yahoo!?
> >>  > 150ÍòÇúMP3·è¿ñËÑ£¬´øÄú´³ÈëÒôÀÖµîÌÃ
> >>  > http://music.yisou.com/
> >>  > ÃÀÅ®Ã÷ÐÇÓ¦Óо¡ÓУ¬ËѱéÃÀͼ¡¢ÑÞͼºÍ¿áͼ
> >>  > http://image.yisou.com
> >>  > 1G¾ÍÊÇ1000Õ×£¬ÑÅ»¢µçÓÊ×ÔÖúÀ©ÈÝ£¡
> >>  >
> >>http://cn.rd.yahoo.com/mail_cn/tag/1g/*http://cn.mail.yahoo.com/event/ma
> >>  > il_1g/
> >>  > 
> >>  >
> ---------------------------------------------------------------------
> >>  > To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
> >>  > For additional commands, e-mail: lucene-user-help@jakarta.apache.org
> >>
> >>
> >>---------------------------------------------------------------------
> >>To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
> >>For additional commands, e-mail: lucene-user-help@jakarta.apache.org
> >>
> >>    
> >>
> >
> >  
> >
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
> For additional commands, e-mail: lucene-user-help@jakarta.apache.org
> 

-- 
10 GB Mailbox, 100 FreeSMS http://www.gmx.net/de/go/topmail
+++ GMX - die erste Adresse für Mail, Message, More +++

---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org


Mime
View raw message