lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Karl Koch" <TheRan...@gmx.net>
Subject Re: which HTML parser is better?
Date Thu, 03 Feb 2005 09:58:52 GMT
Unfortunaltiy I am faithful ;-). Just for practical reason I want to do that
in a single class or even method called by another part in my Java
application. It should also run on Java 1.1 and it should be small and
simple. As I said before, I am in control of the HTML and it will be well
formated, because I generate it from XML using XSLT.

Karl

> If you are not married to Java:
> http://search.cpan.org/~kilinrax/HTML-Strip-1.04/Strip.pm
> 
> Otis
> 
> --- sergiu gordea <gsergiu@ifit.uni-klu.ac.at> wrote:
> 
> > Karl Koch wrote:
> > 
> > >I am in control of the html, which means it is well formated HTML. I
> > use
> > >only HTML files which I have transformed from XML. No external HTML
> > (e.g.
> > >the web).
> > >
> > >Are there any very-short solutions for that?
> > >  
> > >
> > if you are using only correct formated HTML pages and you are in
> > control 
> > of these pages.
> > you can use a regular exprestion to remove the tags.
> > 
> > something like
> > replaceAll("<*>","");
> > 
> > This is the ideea behind the operation. If you will search on google
> > you 
> > will find a more robust
> > regular expression.
> > 
> > Using a simple regular expression will be a very cheap solution, that
> > 
> > can cause you a lot of problems in the future.
> >  
> >  It's up to you to use it ....
> > 
> >  Best,
> >  
> >  Sergiu
> > 
> > >Karl
> > >
> > >  
> > >
> > >>Karl Koch wrote:
> > >>
> > >>    
> > >>
> > >>>Hi,
> > >>>
> > >>>yes, but the library your are using is quite big. I was thinking
> > that a
> > >>>      
> > >>>
> > >>5kB
> > >>    
> > >>
> > >>>code could actually do that. That sourceforge project is doing
> > much more
> > >>>than that but I do not need it.
> > >>> 
> > >>>
> > >>>      
> > >>>
> > >>you need just the htmlparser.jar 200k.
> > >>... you know ... the functionality is strongly correclated with the
> > size.
> > >>
> > >>  You can use 3 lines of code with a good regular expresion to
> > eliminate 
> > >>the html tags,
> > >>but this won't give you any guarantie that the text from the bad 
> > >>fromated html files will be
> > >>correctly extracted...
> > >>
> > >>  Best,
> > >>
> > >>  Sergiu
> > >>
> > >>    
> > >>
> > >>>Karl
> > >>>
> > >>> 
> > >>>
> > >>>      
> > >>>
> > >>>> Hi Karl,
> > >>>>
> > >>>>I already submitted a peace of code that removes the html tags.
> > >>>>Search for my previous answer in this thread.
> > >>>>
> > >>>> Best,
> > >>>>
> > >>>>  Sergiu
> > >>>>
> > >>>>Karl Koch wrote:
> > >>>>
> > >>>>   
> > >>>>
> > >>>>        
> > >>>>
> > >>>>>Hello,
> > >>>>>
> > >>>>>I have  been following this thread and have another question.

> > >>>>>
> > >>>>>Is there a piece of sourcecode (which is preferably very short
> > and
> > >>>>>          
> > >>>>>
> > >>simple
> > >>    
> > >>
> > >>>>>(KISS)) which allows to remove all HTML tags from HTML content?
> > HTML
> > >>>>>          
> > >>>>>
> > >>3.2
> > >>    
> > >>
> > >>>>>would be enough...also no frames, CSS, etc. 
> > >>>>>
> > >>>>>I do not need to have the HTML strucutre tree or any other
> > structure
> > >>>>>          
> > >>>>>
> > >>but
> > >>    
> > >>
> > >>>>>need a facility to clean up HTML into its normal underlying
> > content
> > >>>>>     
> > >>>>>
> > >>>>>          
> > >>>>>
> > >>>>before
> > >>>>   
> > >>>>
> > >>>>        
> > >>>>
> > >>>>>indexing that content as a whole.
> > >>>>>
> > >>>>>Karl
> > >>>>>
> > >>>>>
> > >>>>>
> > >>>>>
> > >>>>>     
> > >>>>>
> > >>>>>          
> > >>>>>
> > >>>>>>I think that depends on what you want to do.  The Lucene
demo
> > parser
> > >>>>>>       
> > >>>>>>
> > >>>>>>            
> > >>>>>>
> > >>>>does
> > >>>>   
> > >>>>
> > >>>>        
> > >>>>
> > >>>>>>simple mapping of HTML files into Lucene Documents; it does
not
> > give
> > >>>>>>            
> > >>>>>>
> > >>you
> > >>    
> > >>
> > >>>>>>       
> > >>>>>>
> > >>>>>>            
> > >>>>>>
> > >>>>a
> > >>>>   
> > >>>>
> > >>>>        
> > >>>>
> > >>>>>>parse tree for the HTML doc.  CyberNeko is an extension
of
> > Xerces
> > >>>>>>            
> > >>>>>>
> > >>(uses
> > >>    
> > >>
> > >>>>>>  
> > >>>>>>
> > >>>>>>       
> > >>>>>>
> > >>>>>>            
> > >>>>>>
> > >>>>>the
> > >>>>>
> > >>>>>
> > >>>>>     
> > >>>>>
> > >>>>>          
> > >>>>>
> > >>>>>>same API; will likely become part of Xerces), and so maps
an
> > HTML
> > >>>>>>       
> > >>>>>>
> > >>>>>>            
> > >>>>>>
> > >>>>document
> > >>>>   
> > >>>>
> > >>>>        
> > >>>>
> > >>>>>>into a full DOM that you can manipulate easily for a wide
range
> > of
> > >>>>>>purposes.  I haven't used JTidy at an API level and so don't
> > know it
> > >>>>>>            
> > >>>>>>
> > >>as
> > >>    
> > >>
> > >>>>>>  
> > >>>>>>
> > >>>>>>       
> > >>>>>>
> > >>>>>>            
> > >>>>>>
> > >>>>>well --
> > >>>>>
> > >>>>>
> > >>>>>     
> > >>>>>
> > >>>>>          
> > >>>>>
> > >>>>>>based on its UI, it appears to be focused primarily on HTML
> > validation
> > >>>>>>       
> > >>>>>>
> > >>>>>>            
> > >>>>>>
> > >>>>and
> > >>>>   
> > >>>>
> > >>>>        
> > >>>>
> > >>>>>>error detection/correction.
> > >>>>>>
> > >>>>>>I use CyberNeko for a range of operations on HTML documents
> > that go
> > >>>>>>       
> > >>>>>>
> > >>>>>>            
> > >>>>>>
> > >>>>beyond
> > >>>>   
> > >>>>
> > >>>>        
> > >>>>
> > >>>>>>indexing them in Lucene, and really like it.  It has been
> > robust for
> > >>>>>>            
> > >>>>>>
> > >>me
> > >>    
> > >>
> > >>>>>>       
> > >>>>>>
> > >>>>>>            
> > >>>>>>
> > >>>>so
> > >>>>   
> > >>>>
> > >>>>        
> > >>>>
> > >>>>>>far.
> > >>>>>>
> > >>>>>>Chuck
> > >>>>>>
> > >>>>>>            
> > >>>>>>
> > >>>>>>>-----Original Message-----
> > >>>>>>>From: Jingkang Zhang [mailto:zjingk@yahoo.com.cn]
> > >>>>>>>Sent: Tuesday, February 01, 2005 1:15 AM
> > >>>>>>>To: lucene-user@jakarta.apache.org
> > >>>>>>>Subject: which HTML parser is better?
> > >>>>>>>
> > >>>>>>>Three HTML parsers(Lucene web application
> > >>>>>>>demo,CyberNeko HTML Parser,JTidy) are mentioned in
> > >>>>>>>Lucene FAQ
> > >>>>>>>1.3.27.Which is the best?Can it filter tags that are
> > >>>>>>>auto-created by MS-word 'Save As HTML files' function?
> > >>>>>>>
> > >>>>>>>_________________________________________________________
> > >>>>>>>Do You Yahoo!?
> > >>>>>>>150ÍòÇúMP3·è¿ñËÑ£¬´øÄú´³ÈëÒôÀÖµîÌÃ
> > >>>>>>>http://music.yisou.com/
> > >>>>>>>ÃÀÅ®Ã÷ÐÇÓ¦Óо¡ÓУ¬ËѱéÃÀͼ¡¢ÑÞͼºÍ¿áͼ
> > >>>>>>>http://image.yisou.com
> > >>>>>>>1G¾ÍÊÇ1000Õ×£¬ÑÅ»¢µçÓÊ×ÔÖúÀ©ÈÝ£¡
> > >>>>>>>
> > >>>>>>>              
> > >>>>>>>
> >
>
>>>>>http://cn.rd.yahoo.com/mail_cn/tag/1g/*http://cn.mail.yahoo.com/event/ma
> > >>>>>          
> > >>>>>
> > >>>>>>>il_1g/
> > >>>>>>>
> > >>>>>>>
> > >>>>>>>              
> > >>>>>>>
> > >>>>>>       
> > >>>>>>
> > >>>>>>            
> > >>>>>>
> >
> >>>>---------------------------------------------------------------------
> > >>>>   
> > >>>>
> > >>>>        
> > >>>>
> > >>>>>>>To unsubscribe, e-mail:
> > lucene-user-unsubscribe@jakarta.apache.org
> > >>>>>>>For additional commands, e-mail:
> > >>>>>>>              
> > >>>>>>>
> > >>lucene-user-help@jakarta.apache.org
> > >>    
> > >>
> >
>
>>>>>>---------------------------------------------------------------------
> > >>>>>>To unsubscribe, e-mail:
> > lucene-user-unsubscribe@jakarta.apache.org
> > >>>>>>For additional commands, e-mail:
> > lucene-user-help@jakarta.apache.org
> > >>>>>>
> > >>>>>>  
> > >>>>>>
> > >>>>>>       
> > >>>>>>
> > >>>>>>            
> > >>>>>>
> > >>>>>     
> > >>>>>
> > >>>>>          
> > >>>>>
> >
> >>>>---------------------------------------------------------------------
> > >>>>To unsubscribe, e-mail:
> > lucene-user-unsubscribe@jakarta.apache.org
> > >>>>For additional commands, e-mail:
> > lucene-user-help@jakarta.apache.org
> > >>>>
> > >>>>   
> > >>>>
> > >>>>        
> > >>>>
> > >>> 
> > >>>
> > >>>      
> > >>>
> >
> >>---------------------------------------------------------------------
> > >>To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
> > >>For additional commands, e-mail:
> > lucene-user-help@jakarta.apache.org
> > >>
> > >>    
> > >>
> > >
> > >  
> > >
> > 
> > 
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
> > For additional commands, e-mail: lucene-user-help@jakarta.apache.org
> > 
> > 
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
> For additional commands, e-mail: lucene-user-help@jakarta.apache.org
> 

-- 
10 GB Mailbox, 100 FreeSMS http://www.gmx.net/de/go/topmail
+++ GMX - die erste Adresse für Mail, Message, More +++

---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org


Mime
View raw message