lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Karl Koch" <TheRan...@gmx.net>
Subject Re: which HTML parser is better?
Date Thu, 03 Feb 2005 09:54:46 GMT
Hello Sergiu,

thank you for your help so far. I appreciate it.

I am working with Java 1.1 which does not include regular expressions.

Your turn ;-)
Karl 

> Karl Koch wrote:
> 
> >I am in control of the html, which means it is well formated HTML. I use
> >only HTML files which I have transformed from XML. No external HTML (e.g.
> >the web).
> >
> >Are there any very-short solutions for that?
> >  
> >
> if you are using only correct formated HTML pages and you are in control 
> of these pages.
> you can use a regular exprestion to remove the tags.
> 
> something like
> replaceAll("<*>","");
> 
> This is the ideea behind the operation. If you will search on google you 
> will find a more robust
> regular expression.
> 
> Using a simple regular expression will be a very cheap solution, that 
> can cause you a lot of problems in the future.
>  
>  It's up to you to use it ....
> 
>  Best,
>  
>  Sergiu
> 
> >Karl
> >
> >  
> >
> >>Karl Koch wrote:
> >>
> >>    
> >>
> >>>Hi,
> >>>
> >>>yes, but the library your are using is quite big. I was thinking that a
> >>>      
> >>>
> >>5kB
> >>    
> >>
> >>>code could actually do that. That sourceforge project is doing much
> more
> >>>than that but I do not need it.
> >>> 
> >>>
> >>>      
> >>>
> >>you need just the htmlparser.jar 200k.
> >>... you know ... the functionality is strongly correclated with the
> size.
> >>
> >>  You can use 3 lines of code with a good regular expresion to eliminate
> >>the html tags,
> >>but this won't give you any guarantie that the text from the bad 
> >>fromated html files will be
> >>correctly extracted...
> >>
> >>  Best,
> >>
> >>  Sergiu
> >>
> >>    
> >>
> >>>Karl
> >>>
> >>> 
> >>>
> >>>      
> >>>
> >>>> Hi Karl,
> >>>>
> >>>>I already submitted a peace of code that removes the html tags.
> >>>>Search for my previous answer in this thread.
> >>>>
> >>>> Best,
> >>>>
> >>>>  Sergiu
> >>>>
> >>>>Karl Koch wrote:
> >>>>
> >>>>   
> >>>>
> >>>>        
> >>>>
> >>>>>Hello,
> >>>>>
> >>>>>I have  been following this thread and have another question. 
> >>>>>
> >>>>>Is there a piece of sourcecode (which is preferably very short and
> >>>>>          
> >>>>>
> >>simple
> >>    
> >>
> >>>>>(KISS)) which allows to remove all HTML tags from HTML content? HTML
> >>>>>          
> >>>>>
> >>3.2
> >>    
> >>
> >>>>>would be enough...also no frames, CSS, etc. 
> >>>>>
> >>>>>I do not need to have the HTML strucutre tree or any other structure
> >>>>>          
> >>>>>
> >>but
> >>    
> >>
> >>>>>need a facility to clean up HTML into its normal underlying content
> >>>>>     
> >>>>>
> >>>>>          
> >>>>>
> >>>>before
> >>>>   
> >>>>
> >>>>        
> >>>>
> >>>>>indexing that content as a whole.
> >>>>>
> >>>>>Karl
> >>>>>
> >>>>>
> >>>>>
> >>>>>
> >>>>>     
> >>>>>
> >>>>>          
> >>>>>
> >>>>>>I think that depends on what you want to do.  The Lucene demo
parser
> >>>>>>       
> >>>>>>
> >>>>>>            
> >>>>>>
> >>>>does
> >>>>   
> >>>>
> >>>>        
> >>>>
> >>>>>>simple mapping of HTML files into Lucene Documents; it does not
give
> >>>>>>            
> >>>>>>
> >>you
> >>    
> >>
> >>>>>>       
> >>>>>>
> >>>>>>            
> >>>>>>
> >>>>a
> >>>>   
> >>>>
> >>>>        
> >>>>
> >>>>>>parse tree for the HTML doc.  CyberNeko is an extension of Xerces
> >>>>>>            
> >>>>>>
> >>(uses
> >>    
> >>
> >>>>>>  
> >>>>>>
> >>>>>>       
> >>>>>>
> >>>>>>            
> >>>>>>
> >>>>>the
> >>>>>
> >>>>>
> >>>>>     
> >>>>>
> >>>>>          
> >>>>>
> >>>>>>same API; will likely become part of Xerces), and so maps an
HTML
> >>>>>>       
> >>>>>>
> >>>>>>            
> >>>>>>
> >>>>document
> >>>>   
> >>>>
> >>>>        
> >>>>
> >>>>>>into a full DOM that you can manipulate easily for a wide range
of
> >>>>>>purposes.  I haven't used JTidy at an API level and so don't
know it
> >>>>>>            
> >>>>>>
> >>as
> >>    
> >>
> >>>>>>  
> >>>>>>
> >>>>>>       
> >>>>>>
> >>>>>>            
> >>>>>>
> >>>>>well --
> >>>>>
> >>>>>
> >>>>>     
> >>>>>
> >>>>>          
> >>>>>
> >>>>>>based on its UI, it appears to be focused primarily on HTML
> validation
> >>>>>>       
> >>>>>>
> >>>>>>            
> >>>>>>
> >>>>and
> >>>>   
> >>>>
> >>>>        
> >>>>
> >>>>>>error detection/correction.
> >>>>>>
> >>>>>>I use CyberNeko for a range of operations on HTML documents that
go
> >>>>>>       
> >>>>>>
> >>>>>>            
> >>>>>>
> >>>>beyond
> >>>>   
> >>>>
> >>>>        
> >>>>
> >>>>>>indexing them in Lucene, and really like it.  It has been robust
for
> >>>>>>            
> >>>>>>
> >>me
> >>    
> >>
> >>>>>>       
> >>>>>>
> >>>>>>            
> >>>>>>
> >>>>so
> >>>>   
> >>>>
> >>>>        
> >>>>
> >>>>>>far.
> >>>>>>
> >>>>>>Chuck
> >>>>>>
> >>>>>>            
> >>>>>>
> >>>>>>>-----Original Message-----
> >>>>>>>From: Jingkang Zhang [mailto:zjingk@yahoo.com.cn]
> >>>>>>>Sent: Tuesday, February 01, 2005 1:15 AM
> >>>>>>>To: lucene-user@jakarta.apache.org
> >>>>>>>Subject: which HTML parser is better?
> >>>>>>>
> >>>>>>>Three HTML parsers(Lucene web application
> >>>>>>>demo,CyberNeko HTML Parser,JTidy) are mentioned in
> >>>>>>>Lucene FAQ
> >>>>>>>1.3.27.Which is the best?Can it filter tags that are
> >>>>>>>auto-created by MS-word 'Save As HTML files' function?
> >>>>>>>
> >>>>>>>_________________________________________________________
> >>>>>>>Do You Yahoo!?
> >>>>>>>150ÍòÇúMP3·è¿ñËÑ£¬´øÄú´³ÈëÒôÀÖµîÌÃ
> >>>>>>>http://music.yisou.com/
> >>>>>>>ÃÀÅ®Ã÷ÐÇÓ¦Óо¡ÓУ¬ËѱéÃÀͼ¡¢ÑÞͼºÍ¿áͼ
> >>>>>>>http://image.yisou.com
> >>>>>>>1G¾ÍÊÇ1000Õ×£¬ÑÅ»¢µçÓÊ×ÔÖúÀ©ÈÝ£¡
> >>>>>>>
> >>>>>>>              
> >>>>>>>
>
>>>>>http://cn.rd.yahoo.com/mail_cn/tag/1g/*http://cn.mail.yahoo.com/event/ma
> >>>>>          
> >>>>>
> >>>>>>>il_1g/
> >>>>>>>
> >>>>>>>
> >>>>>>>              
> >>>>>>>
> >>>>>>       
> >>>>>>
> >>>>>>            
> >>>>>>
> >>>>---------------------------------------------------------------------
> >>>>   
> >>>>
> >>>>        
> >>>>
> >>>>>>>To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
> >>>>>>>For additional commands, e-mail:
> >>>>>>>              
> >>>>>>>
> >>lucene-user-help@jakarta.apache.org
> >>    
> >>
>
>>>>>>---------------------------------------------------------------------
> >>>>>>To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
> >>>>>>For additional commands, e-mail: lucene-user-help@jakarta.apache.org
> >>>>>>
> >>>>>>  
> >>>>>>
> >>>>>>       
> >>>>>>
> >>>>>>            
> >>>>>>
> >>>>>     
> >>>>>
> >>>>>          
> >>>>>
> >>>>---------------------------------------------------------------------
> >>>>To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
> >>>>For additional commands, e-mail: lucene-user-help@jakarta.apache.org
> >>>>
> >>>>   
> >>>>
> >>>>        
> >>>>
> >>> 
> >>>
> >>>      
> >>>
> >>---------------------------------------------------------------------
> >>To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
> >>For additional commands, e-mail: lucene-user-help@jakarta.apache.org
> >>
> >>    
> >>
> >
> >  
> >
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
> For additional commands, e-mail: lucene-user-help@jakarta.apache.org
> 

-- 
10 GB Mailbox, 100 FreeSMS http://www.gmx.net/de/go/topmail
+++ GMX - die erste Adresse für Mail, Message, More +++

---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org


Mime
View raw message