lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Otis Gospodnetic <otis_gospodne...@yahoo.com>
Subject Re: which HTML parser is better?
Date Wed, 02 Feb 2005 19:40:03 GMT
If you are not married to Java:
http://search.cpan.org/~kilinrax/HTML-Strip-1.04/Strip.pm

Otis

--- sergiu gordea <gsergiu@ifit.uni-klu.ac.at> wrote:

> Karl Koch wrote:
> 
> >I am in control of the html, which means it is well formated HTML. I
> use
> >only HTML files which I have transformed from XML. No external HTML
> (e.g.
> >the web).
> >
> >Are there any very-short solutions for that?
> >  
> >
> if you are using only correct formated HTML pages and you are in
> control 
> of these pages.
> you can use a regular exprestion to remove the tags.
> 
> something like
> replaceAll("<*>","");
> 
> This is the ideea behind the operation. If you will search on google
> you 
> will find a more robust
> regular expression.
> 
> Using a simple regular expression will be a very cheap solution, that
> 
> can cause you a lot of problems in the future.
>  
>  It's up to you to use it ....
> 
>  Best,
>  
>  Sergiu
> 
> >Karl
> >
> >  
> >
> >>Karl Koch wrote:
> >>
> >>    
> >>
> >>>Hi,
> >>>
> >>>yes, but the library your are using is quite big. I was thinking
> that a
> >>>      
> >>>
> >>5kB
> >>    
> >>
> >>>code could actually do that. That sourceforge project is doing
> much more
> >>>than that but I do not need it.
> >>> 
> >>>
> >>>      
> >>>
> >>you need just the htmlparser.jar 200k.
> >>... you know ... the functionality is strongly correclated with the
> size.
> >>
> >>  You can use 3 lines of code with a good regular expresion to
> eliminate 
> >>the html tags,
> >>but this won't give you any guarantie that the text from the bad 
> >>fromated html files will be
> >>correctly extracted...
> >>
> >>  Best,
> >>
> >>  Sergiu
> >>
> >>    
> >>
> >>>Karl
> >>>
> >>> 
> >>>
> >>>      
> >>>
> >>>> Hi Karl,
> >>>>
> >>>>I already submitted a peace of code that removes the html tags.
> >>>>Search for my previous answer in this thread.
> >>>>
> >>>> Best,
> >>>>
> >>>>  Sergiu
> >>>>
> >>>>Karl Koch wrote:
> >>>>
> >>>>   
> >>>>
> >>>>        
> >>>>
> >>>>>Hello,
> >>>>>
> >>>>>I have  been following this thread and have another question. 
> >>>>>
> >>>>>Is there a piece of sourcecode (which is preferably very short
> and
> >>>>>          
> >>>>>
> >>simple
> >>    
> >>
> >>>>>(KISS)) which allows to remove all HTML tags from HTML content?
> HTML
> >>>>>          
> >>>>>
> >>3.2
> >>    
> >>
> >>>>>would be enough...also no frames, CSS, etc. 
> >>>>>
> >>>>>I do not need to have the HTML strucutre tree or any other
> structure
> >>>>>          
> >>>>>
> >>but
> >>    
> >>
> >>>>>need a facility to clean up HTML into its normal underlying
> content
> >>>>>     
> >>>>>
> >>>>>          
> >>>>>
> >>>>before
> >>>>   
> >>>>
> >>>>        
> >>>>
> >>>>>indexing that content as a whole.
> >>>>>
> >>>>>Karl
> >>>>>
> >>>>>
> >>>>>
> >>>>>
> >>>>>     
> >>>>>
> >>>>>          
> >>>>>
> >>>>>>I think that depends on what you want to do.  The Lucene demo
> parser
> >>>>>>       
> >>>>>>
> >>>>>>            
> >>>>>>
> >>>>does
> >>>>   
> >>>>
> >>>>        
> >>>>
> >>>>>>simple mapping of HTML files into Lucene Documents; it does not
> give
> >>>>>>            
> >>>>>>
> >>you
> >>    
> >>
> >>>>>>       
> >>>>>>
> >>>>>>            
> >>>>>>
> >>>>a
> >>>>   
> >>>>
> >>>>        
> >>>>
> >>>>>>parse tree for the HTML doc.  CyberNeko is an extension of
> Xerces
> >>>>>>            
> >>>>>>
> >>(uses
> >>    
> >>
> >>>>>>  
> >>>>>>
> >>>>>>       
> >>>>>>
> >>>>>>            
> >>>>>>
> >>>>>the
> >>>>>
> >>>>>
> >>>>>     
> >>>>>
> >>>>>          
> >>>>>
> >>>>>>same API; will likely become part of Xerces), and so maps an
> HTML
> >>>>>>       
> >>>>>>
> >>>>>>            
> >>>>>>
> >>>>document
> >>>>   
> >>>>
> >>>>        
> >>>>
> >>>>>>into a full DOM that you can manipulate easily for a wide range
> of
> >>>>>>purposes.  I haven't used JTidy at an API level and so don't
> know it
> >>>>>>            
> >>>>>>
> >>as
> >>    
> >>
> >>>>>>  
> >>>>>>
> >>>>>>       
> >>>>>>
> >>>>>>            
> >>>>>>
> >>>>>well --
> >>>>>
> >>>>>
> >>>>>     
> >>>>>
> >>>>>          
> >>>>>
> >>>>>>based on its UI, it appears to be focused primarily on HTML
> validation
> >>>>>>       
> >>>>>>
> >>>>>>            
> >>>>>>
> >>>>and
> >>>>   
> >>>>
> >>>>        
> >>>>
> >>>>>>error detection/correction.
> >>>>>>
> >>>>>>I use CyberNeko for a range of operations on HTML documents
> that go
> >>>>>>       
> >>>>>>
> >>>>>>            
> >>>>>>
> >>>>beyond
> >>>>   
> >>>>
> >>>>        
> >>>>
> >>>>>>indexing them in Lucene, and really like it.  It has been
> robust for
> >>>>>>            
> >>>>>>
> >>me
> >>    
> >>
> >>>>>>       
> >>>>>>
> >>>>>>            
> >>>>>>
> >>>>so
> >>>>   
> >>>>
> >>>>        
> >>>>
> >>>>>>far.
> >>>>>>
> >>>>>>Chuck
> >>>>>>
> >>>>>>            
> >>>>>>
> >>>>>>>-----Original Message-----
> >>>>>>>From: Jingkang Zhang [mailto:zjingk@yahoo.com.cn]
> >>>>>>>Sent: Tuesday, February 01, 2005 1:15 AM
> >>>>>>>To: lucene-user@jakarta.apache.org
> >>>>>>>Subject: which HTML parser is better?
> >>>>>>>
> >>>>>>>Three HTML parsers(Lucene web application
> >>>>>>>demo,CyberNeko HTML Parser,JTidy) are mentioned in
> >>>>>>>Lucene FAQ
> >>>>>>>1.3.27.Which is the best?Can it filter tags that are
> >>>>>>>auto-created by MS-word 'Save As HTML files' function?
> >>>>>>>
> >>>>>>>_________________________________________________________
> >>>>>>>Do You Yahoo!?
> >>>>>>>150万曲MP3疯狂搜,带您闯入音乐殿堂
> >>>>>>>http://music.yisou.com/
> >>>>>>>美女明星应有尽有,搜遍美图、艳图和酷图
> >>>>>>>http://image.yisou.com
> >>>>>>>1G就是1000兆,雅虎电邮自助扩容!
> >>>>>>>
> >>>>>>>              
> >>>>>>>
>
>>>>>http://cn.rd.yahoo.com/mail_cn/tag/1g/*http://cn.mail.yahoo.com/event/ma
> >>>>>          
> >>>>>
> >>>>>>>il_1g/
> >>>>>>>
> >>>>>>>
> >>>>>>>              
> >>>>>>>
> >>>>>>       
> >>>>>>
> >>>>>>            
> >>>>>>
>
>>>>---------------------------------------------------------------------
> >>>>   
> >>>>
> >>>>        
> >>>>
> >>>>>>>To unsubscribe, e-mail:
> lucene-user-unsubscribe@jakarta.apache.org
> >>>>>>>For additional commands, e-mail:
> >>>>>>>              
> >>>>>>>
> >>lucene-user-help@jakarta.apache.org
> >>    
> >>
>
>>>>>>---------------------------------------------------------------------
> >>>>>>To unsubscribe, e-mail:
> lucene-user-unsubscribe@jakarta.apache.org
> >>>>>>For additional commands, e-mail:
> lucene-user-help@jakarta.apache.org
> >>>>>>
> >>>>>>  
> >>>>>>
> >>>>>>       
> >>>>>>
> >>>>>>            
> >>>>>>
> >>>>>     
> >>>>>
> >>>>>          
> >>>>>
>
>>>>---------------------------------------------------------------------
> >>>>To unsubscribe, e-mail:
> lucene-user-unsubscribe@jakarta.apache.org
> >>>>For additional commands, e-mail:
> lucene-user-help@jakarta.apache.org
> >>>>
> >>>>   
> >>>>
> >>>>        
> >>>>
> >>> 
> >>>
> >>>      
> >>>
>
>>---------------------------------------------------------------------
> >>To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
> >>For additional commands, e-mail:
> lucene-user-help@jakarta.apache.org
> >>
> >>    
> >>
> >
> >  
> >
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
> For additional commands, e-mail: lucene-user-help@jakarta.apache.org
> 
> 


---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org


Mime
View raw message