lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Karl Koch" <TheRan...@gmx.net>
Subject Re: which HTML parser is better?
Date Wed, 02 Feb 2005 18:03:20 GMT
I am in control of the html, which means it is well formated HTML. I use
only HTML files which I have transformed from XML. No external HTML (e.g.
the web).

Are there any very-short solutions for that?

Karl

> Karl Koch wrote:
> 
> >Hi,
> >
> >yes, but the library your are using is quite big. I was thinking that a
> 5kB
> >code could actually do that. That sourceforge project is doing much more
> >than that but I do not need it.
> >  
> >
> you need just the htmlparser.jar 200k.
> ... you know ... the functionality is strongly correclated with the size.
> 
>   You can use 3 lines of code with a good regular expresion to eliminate 
> the html tags,
> but this won't give you any guarantie that the text from the bad 
> fromated html files will be
> correctly extracted...
> 
>   Best,
> 
>   Sergiu
> 
> >Karl
> >
> >  
> >
> >>  Hi Karl,
> >>
> >> I already submitted a peace of code that removes the html tags.
> >> Search for my previous answer in this thread.
> >>
> >>  Best,
> >>
> >>   Sergiu
> >>
> >>Karl Koch wrote:
> >>
> >>    
> >>
> >>>Hello,
> >>>
> >>>I have  been following this thread and have another question. 
> >>>
> >>>Is there a piece of sourcecode (which is preferably very short and
> simple
> >>>(KISS)) which allows to remove all HTML tags from HTML content? HTML
> 3.2
> >>>would be enough...also no frames, CSS, etc. 
> >>>
> >>>I do not need to have the HTML strucutre tree or any other structure
> but
> >>>need a facility to clean up HTML into its normal underlying content
> >>>      
> >>>
> >>before
> >>    
> >>
> >>>indexing that content as a whole.
> >>>
> >>>Karl
> >>>
> >>>
> >>> 
> >>>
> >>>      
> >>>
> >>>>I think that depends on what you want to do.  The Lucene demo parser
> >>>>        
> >>>>
> >>does
> >>    
> >>
> >>>>simple mapping of HTML files into Lucene Documents; it does not give
> you
> >>>>        
> >>>>
> >>a
> >>    
> >>
> >>>>parse tree for the HTML doc.  CyberNeko is an extension of Xerces
> (uses
> >>>>   
> >>>>
> >>>>        
> >>>>
> >>>the
> >>> 
> >>>
> >>>      
> >>>
> >>>>same API; will likely become part of Xerces), and so maps an HTML
> >>>>        
> >>>>
> >>document
> >>    
> >>
> >>>>into a full DOM that you can manipulate easily for a wide range of
> >>>>purposes.  I haven't used JTidy at an API level and so don't know it
> as
> >>>>   
> >>>>
> >>>>        
> >>>>
> >>>well --
> >>> 
> >>>
> >>>      
> >>>
> >>>>based on its UI, it appears to be focused primarily on HTML validation
> >>>>        
> >>>>
> >>and
> >>    
> >>
> >>>>error detection/correction.
> >>>>
> >>>>I use CyberNeko for a range of operations on HTML documents that go
> >>>>        
> >>>>
> >>beyond
> >>    
> >>
> >>>>indexing them in Lucene, and really like it.  It has been robust for
> me
> >>>>        
> >>>>
> >>so
> >>    
> >>
> >>>>far.
> >>>>
> >>>>Chuck
> >>>>
> >>>> > -----Original Message-----
> >>>> > From: Jingkang Zhang [mailto:zjingk@yahoo.com.cn]
> >>>> > Sent: Tuesday, February 01, 2005 1:15 AM
> >>>> > To: lucene-user@jakarta.apache.org
> >>>> > Subject: which HTML parser is better?
> >>>> > 
> >>>> > Three HTML parsers(Lucene web application
> >>>> > demo,CyberNeko HTML Parser,JTidy) are mentioned in
> >>>> > Lucene FAQ
> >>>> > 1.3.27.Which is the best?Can it filter tags that are
> >>>> > auto-created by MS-word 'Save As HTML files' function?
> >>>> > 
> >>>> > _________________________________________________________
> >>>> > Do You Yahoo!?
> >>>> > 150ÍòÇúMP3·è¿ñËÑ£¬´øÄú´³ÈëÒôÀÖµîÌÃ
> >>>> > http://music.yisou.com/
> >>>> > ÃÀÅ®Ã÷ÐÇÓ¦Óо¡ÓУ¬ËѱéÃÀͼ¡¢ÑÞͼºÍ¿áͼ
> >>>> > http://image.yisou.com
> >>>> > 1G¾ÍÊÇ1000Õ×£¬ÑÅ»¢µçÓÊ×ÔÖúÀ©ÈÝ£¡
> >>>> >
>
>>>>http://cn.rd.yahoo.com/mail_cn/tag/1g/*http://cn.mail.yahoo.com/event/ma
> >>>> > il_1g/
> >>>> > 
> >>>> >
> >>>>        
> >>>>
> >>---------------------------------------------------------------------
> >>    
> >>
> >>>> > To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
> >>>> > For additional commands, e-mail:
> lucene-user-help@jakarta.apache.org
> >>>>
> >>>>
> >>>>---------------------------------------------------------------------
> >>>>To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
> >>>>For additional commands, e-mail: lucene-user-help@jakarta.apache.org
> >>>>
> >>>>   
> >>>>
> >>>>        
> >>>>
> >>> 
> >>>
> >>>      
> >>>
> >>---------------------------------------------------------------------
> >>To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
> >>For additional commands, e-mail: lucene-user-help@jakarta.apache.org
> >>
> >>    
> >>
> >
> >  
> >
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
> For additional commands, e-mail: lucene-user-help@jakarta.apache.org
> 

-- 
GMX im TV ... Die Gedanken sind frei ... Schon gesehen?
Jetzt Spot online ansehen: http://www.gmx.net/de/go/tv-spot

---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org


Mime
View raw message