lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From sergiu gordea <gser...@ifit.uni-klu.ac.at>
Subject Re: which HTML parser is better?
Date Thu, 03 Feb 2005 10:07:20 GMT
Karl Koch wrote:

>Unfortunaltiy I am faithful ;-). Just for practical reason I want to do that
>in a single class or even method called by another part in my Java
>application. It should also run on Java 1.1 and it should be small and
>simple. As I said before, I am in control of the HTML and it will be well
>formated, because I generate it from XML using XSLT.
>  
>
Why don't you get the data directly from  XML files?
You can use a SAX parser, ... but I think it will require java 1.3 or at 
least 1.2.2

 Best,

  Sergiu

>Karl
>
>  
>
>>If you are not married to Java:
>>http://search.cpan.org/~kilinrax/HTML-Strip-1.04/Strip.pm
>>
>>Otis
>>
>>--- sergiu gordea <gsergiu@ifit.uni-klu.ac.at> wrote:
>>
>>    
>>
>>>Karl Koch wrote:
>>>
>>>      
>>>
>>>>I am in control of the html, which means it is well formated HTML. I
>>>>        
>>>>
>>>use
>>>      
>>>
>>>>only HTML files which I have transformed from XML. No external HTML
>>>>        
>>>>
>>>(e.g.
>>>      
>>>
>>>>the web).
>>>>
>>>>Are there any very-short solutions for that?
>>>> 
>>>>
>>>>        
>>>>
>>>if you are using only correct formated HTML pages and you are in
>>>control 
>>>of these pages.
>>>you can use a regular exprestion to remove the tags.
>>>
>>>something like
>>>replaceAll("<*>","");
>>>
>>>This is the ideea behind the operation. If you will search on google
>>>you 
>>>will find a more robust
>>>regular expression.
>>>
>>>Using a simple regular expression will be a very cheap solution, that
>>>
>>>can cause you a lot of problems in the future.
>>> 
>>> It's up to you to use it ....
>>>
>>> Best,
>>> 
>>> Sergiu
>>>
>>>      
>>>
>>>>Karl
>>>>
>>>> 
>>>>
>>>>        
>>>>
>>>>>Karl Koch wrote:
>>>>>
>>>>>   
>>>>>
>>>>>          
>>>>>
>>>>>>Hi,
>>>>>>
>>>>>>yes, but the library your are using is quite big. I was thinking
>>>>>>            
>>>>>>
>>>that a
>>>      
>>>
>>>>>>     
>>>>>>
>>>>>>            
>>>>>>
>>>>>5kB
>>>>>   
>>>>>
>>>>>          
>>>>>
>>>>>>code could actually do that. That sourceforge project is doing
>>>>>>            
>>>>>>
>>>much more
>>>      
>>>
>>>>>>than that but I do not need it.
>>>>>>
>>>>>>
>>>>>>     
>>>>>>
>>>>>>            
>>>>>>
>>>>>you need just the htmlparser.jar 200k.
>>>>>... you know ... the functionality is strongly correclated with the
>>>>>          
>>>>>
>>>size.
>>>      
>>>
>>>>> You can use 3 lines of code with a good regular expresion to
>>>>>          
>>>>>
>>>eliminate 
>>>      
>>>
>>>>>the html tags,
>>>>>but this won't give you any guarantie that the text from the bad 
>>>>>fromated html files will be
>>>>>correctly extracted...
>>>>>
>>>>> Best,
>>>>>
>>>>> Sergiu
>>>>>
>>>>>   
>>>>>
>>>>>          
>>>>>
>>>>>>Karl
>>>>>>
>>>>>>
>>>>>>
>>>>>>     
>>>>>>
>>>>>>            
>>>>>>
>>>>>>>Hi Karl,
>>>>>>>
>>>>>>>I already submitted a peace of code that removes the html tags.
>>>>>>>Search for my previous answer in this thread.
>>>>>>>
>>>>>>>Best,
>>>>>>>
>>>>>>> Sergiu
>>>>>>>
>>>>>>>Karl Koch wrote:
>>>>>>>
>>>>>>>  
>>>>>>>
>>>>>>>       
>>>>>>>
>>>>>>>              
>>>>>>>
>>>>>>>>Hello,
>>>>>>>>
>>>>>>>>I have  been following this thread and have another question.

>>>>>>>>
>>>>>>>>Is there a piece of sourcecode (which is preferably very short
>>>>>>>>                
>>>>>>>>
>>>and
>>>      
>>>
>>>>>>>>         
>>>>>>>>
>>>>>>>>                
>>>>>>>>
>>>>>simple
>>>>>   
>>>>>
>>>>>          
>>>>>
>>>>>>>>(KISS)) which allows to remove all HTML tags from HTML content?
>>>>>>>>                
>>>>>>>>
>>>HTML
>>>      
>>>
>>>>>>>>         
>>>>>>>>
>>>>>>>>                
>>>>>>>>
>>>>>3.2
>>>>>   
>>>>>
>>>>>          
>>>>>
>>>>>>>>would be enough...also no frames, CSS, etc. 
>>>>>>>>
>>>>>>>>I do not need to have the HTML strucutre tree or any other
>>>>>>>>                
>>>>>>>>
>>>structure
>>>      
>>>
>>>>>>>>         
>>>>>>>>
>>>>>>>>                
>>>>>>>>
>>>>>but
>>>>>   
>>>>>
>>>>>          
>>>>>
>>>>>>>>need a facility to clean up HTML into its normal underlying
>>>>>>>>                
>>>>>>>>
>>>content
>>>      
>>>
>>>>>>>>    
>>>>>>>>
>>>>>>>>         
>>>>>>>>
>>>>>>>>                
>>>>>>>>
>>>>>>>before
>>>>>>>  
>>>>>>>
>>>>>>>       
>>>>>>>
>>>>>>>              
>>>>>>>
>>>>>>>>indexing that content as a whole.
>>>>>>>>
>>>>>>>>Karl
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>    
>>>>>>>>
>>>>>>>>         
>>>>>>>>
>>>>>>>>                
>>>>>>>>
>>>>>>>>>I think that depends on what you want to do.  The Lucene
demo
>>>>>>>>>                  
>>>>>>>>>
>>>parser
>>>      
>>>
>>>>>>>>>      
>>>>>>>>>
>>>>>>>>>           
>>>>>>>>>
>>>>>>>>>                  
>>>>>>>>>
>>>>>>>does
>>>>>>>  
>>>>>>>
>>>>>>>       
>>>>>>>
>>>>>>>              
>>>>>>>
>>>>>>>>>simple mapping of HTML files into Lucene Documents; it
does not
>>>>>>>>>                  
>>>>>>>>>
>>>give
>>>      
>>>
>>>>>>>>>           
>>>>>>>>>
>>>>>>>>>                  
>>>>>>>>>
>>>>>you
>>>>>   
>>>>>
>>>>>          
>>>>>
>>>>>>>>>      
>>>>>>>>>
>>>>>>>>>           
>>>>>>>>>
>>>>>>>>>                  
>>>>>>>>>
>>>>>>>a
>>>>>>>  
>>>>>>>
>>>>>>>       
>>>>>>>
>>>>>>>              
>>>>>>>
>>>>>>>>>parse tree for the HTML doc.  CyberNeko is an extension
of
>>>>>>>>>                  
>>>>>>>>>
>>>Xerces
>>>      
>>>
>>>>>>>>>           
>>>>>>>>>
>>>>>>>>>                  
>>>>>>>>>
>>>>>(uses
>>>>>   
>>>>>
>>>>>          
>>>>>
>>>>>>>>> 
>>>>>>>>>
>>>>>>>>>      
>>>>>>>>>
>>>>>>>>>           
>>>>>>>>>
>>>>>>>>>                  
>>>>>>>>>
>>>>>>>>the
>>>>>>>>
>>>>>>>>
>>>>>>>>    
>>>>>>>>
>>>>>>>>         
>>>>>>>>
>>>>>>>>                
>>>>>>>>
>>>>>>>>>same API; will likely become part of Xerces), and so maps
an
>>>>>>>>>                  
>>>>>>>>>
>>>HTML
>>>      
>>>
>>>>>>>>>      
>>>>>>>>>
>>>>>>>>>           
>>>>>>>>>
>>>>>>>>>                  
>>>>>>>>>
>>>>>>>document
>>>>>>>  
>>>>>>>
>>>>>>>       
>>>>>>>
>>>>>>>              
>>>>>>>
>>>>>>>>>into a full DOM that you can manipulate easily for a wide
range
>>>>>>>>>                  
>>>>>>>>>
>>>of
>>>      
>>>
>>>>>>>>>purposes.  I haven't used JTidy at an API level and so
don't
>>>>>>>>>                  
>>>>>>>>>
>>>know it
>>>      
>>>
>>>>>>>>>           
>>>>>>>>>
>>>>>>>>>                  
>>>>>>>>>
>>>>>as
>>>>>   
>>>>>
>>>>>          
>>>>>
>>>>>>>>> 
>>>>>>>>>
>>>>>>>>>      
>>>>>>>>>
>>>>>>>>>           
>>>>>>>>>
>>>>>>>>>                  
>>>>>>>>>
>>>>>>>>well --
>>>>>>>>
>>>>>>>>
>>>>>>>>    
>>>>>>>>
>>>>>>>>         
>>>>>>>>
>>>>>>>>                
>>>>>>>>
>>>>>>>>>based on its UI, it appears to be focused primarily on
HTML
>>>>>>>>>                  
>>>>>>>>>
>>>validation
>>>      
>>>
>>>>>>>>>      
>>>>>>>>>
>>>>>>>>>           
>>>>>>>>>
>>>>>>>>>                  
>>>>>>>>>
>>>>>>>and
>>>>>>>  
>>>>>>>
>>>>>>>       
>>>>>>>
>>>>>>>              
>>>>>>>
>>>>>>>>>error detection/correction.
>>>>>>>>>
>>>>>>>>>I use CyberNeko for a range of operations on HTML documents
>>>>>>>>>                  
>>>>>>>>>
>>>that go
>>>      
>>>
>>>>>>>>>      
>>>>>>>>>
>>>>>>>>>           
>>>>>>>>>
>>>>>>>>>                  
>>>>>>>>>
>>>>>>>beyond
>>>>>>>  
>>>>>>>
>>>>>>>       
>>>>>>>
>>>>>>>              
>>>>>>>
>>>>>>>>>indexing them in Lucene, and really like it.  It has been
>>>>>>>>>                  
>>>>>>>>>
>>>robust for
>>>      
>>>
>>>>>>>>>           
>>>>>>>>>
>>>>>>>>>                  
>>>>>>>>>
>>>>>me
>>>>>   
>>>>>
>>>>>          
>>>>>
>>>>>>>>>      
>>>>>>>>>
>>>>>>>>>           
>>>>>>>>>
>>>>>>>>>                  
>>>>>>>>>
>>>>>>>so
>>>>>>>  
>>>>>>>
>>>>>>>       
>>>>>>>
>>>>>>>              
>>>>>>>
>>>>>>>>>far.
>>>>>>>>>
>>>>>>>>>Chuck
>>>>>>>>>
>>>>>>>>>           
>>>>>>>>>
>>>>>>>>>                  
>>>>>>>>>
>>>>>>>>>>-----Original Message-----
>>>>>>>>>>From: Jingkang Zhang [mailto:zjingk@yahoo.com.cn]
>>>>>>>>>>Sent: Tuesday, February 01, 2005 1:15 AM
>>>>>>>>>>To: lucene-user@jakarta.apache.org
>>>>>>>>>>Subject: which HTML parser is better?
>>>>>>>>>>
>>>>>>>>>>Three HTML parsers(Lucene web application
>>>>>>>>>>demo,CyberNeko HTML Parser,JTidy) are mentioned in
>>>>>>>>>>Lucene FAQ
>>>>>>>>>>1.3.27.Which is the best?Can it filter tags that are
>>>>>>>>>>auto-created by MS-word 'Save As HTML files' function?
>>>>>>>>>>
>>>>>>>>>>_________________________________________________________
>>>>>>>>>>Do You Yahoo!?
>>>>>>>>>>150ÍòÇúMP3·è¿ñËÑ£¬´øÄú´³ÈëÒôÀÖµîÌÃ
>>>>>>>>>>http://music.yisou.com/
>>>>>>>>>>ÃÀÅ®Ã÷ÐÇÓ¦Óо¡ÓУ¬ËѱéÃÀͼ¡¢ÑÞͼºÍ¿áͼ
>>>>>>>>>>http://image.yisou.com
>>>>>>>>>>1G¾ÍÊÇ1000Õ×£¬ÑÅ»¢µçÓÊ×ÔÖúÀ©ÈÝ£¡
>>>>>>>>>>
>>>>>>>>>>             
>>>>>>>>>>
>>>>>>>>>>                    
>>>>>>>>>>
>>>>>>http://cn.rd.yahoo.com/mail_cn/tag/1g/*http://cn.mail.yahoo.com/event/ma
>>>>>>            
>>>>>>
>>>>>>>>         
>>>>>>>>
>>>>>>>>                
>>>>>>>>
>>>>>>>>>>il_1g/
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>             
>>>>>>>>>>
>>>>>>>>>>                    
>>>>>>>>>>
>>>>>>>>>      
>>>>>>>>>
>>>>>>>>>           
>>>>>>>>>
>>>>>>>>>                  
>>>>>>>>>
>>>>>>---------------------------------------------------------------------
>>>>>>            
>>>>>>
>>>>>>>  
>>>>>>>
>>>>>>>       
>>>>>>>
>>>>>>>              
>>>>>>>
>>>>>>>>>>To unsubscribe, e-mail:
>>>>>>>>>>                    
>>>>>>>>>>
>>>lucene-user-unsubscribe@jakarta.apache.org
>>>      
>>>
>>>>>>>>>>For additional commands, e-mail:
>>>>>>>>>>             
>>>>>>>>>>
>>>>>>>>>>                    
>>>>>>>>>>
>>>>>lucene-user-help@jakarta.apache.org
>>>>>   
>>>>>
>>>>>          
>>>>>
>>>>>>>---------------------------------------------------------------------
>>>>>>>              
>>>>>>>
>>>>>>>>>To unsubscribe, e-mail:
>>>>>>>>>                  
>>>>>>>>>
>>>lucene-user-unsubscribe@jakarta.apache.org
>>>      
>>>
>>>>>>>>>For additional commands, e-mail:
>>>>>>>>>                  
>>>>>>>>>
>>>lucene-user-help@jakarta.apache.org
>>>      
>>>
>>>>>>>>> 
>>>>>>>>>
>>>>>>>>>      
>>>>>>>>>
>>>>>>>>>           
>>>>>>>>>
>>>>>>>>>                  
>>>>>>>>>
>>>>>>>>    
>>>>>>>>
>>>>>>>>         
>>>>>>>>
>>>>>>>>                
>>>>>>>>
>>>>>>---------------------------------------------------------------------
>>>>>>            
>>>>>>
>>>>>>>To unsubscribe, e-mail:
>>>>>>>              
>>>>>>>
>>>lucene-user-unsubscribe@jakarta.apache.org
>>>      
>>>
>>>>>>>For additional commands, e-mail:
>>>>>>>              
>>>>>>>
>>>lucene-user-help@jakarta.apache.org
>>>      
>>>
>>>>>>>  
>>>>>>>
>>>>>>>       
>>>>>>>
>>>>>>>              
>>>>>>>
>>>>>>     
>>>>>>
>>>>>>            
>>>>>>
>>>>---------------------------------------------------------------------
>>>>        
>>>>
>>>>>To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
>>>>>For additional commands, e-mail:
>>>>>          
>>>>>
>>>lucene-user-help@jakarta.apache.org
>>>      
>>>
>>>>>   
>>>>>
>>>>>          
>>>>>
>>>> 
>>>>
>>>>        
>>>>
>>>---------------------------------------------------------------------
>>>To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
>>>For additional commands, e-mail: lucene-user-help@jakarta.apache.org
>>>
>>>
>>>      
>>>
>>---------------------------------------------------------------------
>>To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
>>For additional commands, e-mail: lucene-user-help@jakarta.apache.org
>>
>>    
>>
>
>  
>


---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org


Mime
View raw message