lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From sergiu gordea <gser...@ifit.uni-klu.ac.at>
Subject Re: which HTML parser is better?
Date Thu, 03 Feb 2005 10:04:22 GMT
Karl Koch wrote:

>Hello Sergiu,
>
>thank you for your help so far. I appreciate it.
>
>I am working with Java 1.1 which does not include regular expressions.
>  
>
Why are you using Java 1.1? Are you so limited in resources?
What operating system do you use?
I asume that you just need to index the html files, and you need a 
html2txt conversion.
If  an external converter si a solution for you, you can use
Runtime.executeCommnand(...) to run the converter that will extract the 
information from your HTMLs
and generate a .txt file. Then you can use a reader to index the txt.

As I told you before, the best solution depends on your constraints 
(time, effort, hardware, performance) and requirements :)

  Best,

  Sergiu

>Your turn ;-)
>Karl 
>
>  
>
>>Karl Koch wrote:
>>
>>    
>>
>>>I am in control of the html, which means it is well formated HTML. I use
>>>only HTML files which I have transformed from XML. No external HTML (e.g.
>>>the web).
>>>
>>>Are there any very-short solutions for that?
>>> 
>>>
>>>      
>>>
>>if you are using only correct formated HTML pages and you are in control 
>>of these pages.
>>you can use a regular exprestion to remove the tags.
>>
>>something like
>>replaceAll("<*>","");
>>
>>This is the ideea behind the operation. If you will search on google you 
>>will find a more robust
>>regular expression.
>>
>>Using a simple regular expression will be a very cheap solution, that 
>>can cause you a lot of problems in the future.
>> 
>> It's up to you to use it ....
>>
>> Best,
>> 
>> Sergiu
>>
>>    
>>
>>>Karl
>>>
>>> 
>>>
>>>      
>>>
>>>>Karl Koch wrote:
>>>>
>>>>   
>>>>
>>>>        
>>>>
>>>>>Hi,
>>>>>
>>>>>yes, but the library your are using is quite big. I was thinking that
a
>>>>>     
>>>>>
>>>>>          
>>>>>
>>>>5kB
>>>>   
>>>>
>>>>        
>>>>
>>>>>code could actually do that. That sourceforge project is doing much
>>>>>          
>>>>>
>>more
>>    
>>
>>>>>than that but I do not need it.
>>>>>
>>>>>
>>>>>     
>>>>>
>>>>>          
>>>>>
>>>>you need just the htmlparser.jar 200k.
>>>>... you know ... the functionality is strongly correclated with the
>>>>        
>>>>
>>size.
>>    
>>
>>>> You can use 3 lines of code with a good regular expresion to eliminate
>>>>the html tags,
>>>>but this won't give you any guarantie that the text from the bad 
>>>>fromated html files will be
>>>>correctly extracted...
>>>>
>>>> Best,
>>>>
>>>> Sergiu
>>>>
>>>>   
>>>>
>>>>        
>>>>
>>>>>Karl
>>>>>
>>>>>
>>>>>
>>>>>     
>>>>>
>>>>>          
>>>>>
>>>>>>Hi Karl,
>>>>>>
>>>>>>I already submitted a peace of code that removes the html tags.
>>>>>>Search for my previous answer in this thread.
>>>>>>
>>>>>>Best,
>>>>>>
>>>>>> Sergiu
>>>>>>
>>>>>>Karl Koch wrote:
>>>>>>
>>>>>>  
>>>>>>
>>>>>>       
>>>>>>
>>>>>>            
>>>>>>
>>>>>>>Hello,
>>>>>>>
>>>>>>>I have  been following this thread and have another question.

>>>>>>>
>>>>>>>Is there a piece of sourcecode (which is preferably very short
and
>>>>>>>         
>>>>>>>
>>>>>>>              
>>>>>>>
>>>>simple
>>>>   
>>>>
>>>>        
>>>>
>>>>>>>(KISS)) which allows to remove all HTML tags from HTML content?
HTML
>>>>>>>         
>>>>>>>
>>>>>>>              
>>>>>>>
>>>>3.2
>>>>   
>>>>
>>>>        
>>>>
>>>>>>>would be enough...also no frames, CSS, etc. 
>>>>>>>
>>>>>>>I do not need to have the HTML strucutre tree or any other structure
>>>>>>>         
>>>>>>>
>>>>>>>              
>>>>>>>
>>>>but
>>>>   
>>>>
>>>>        
>>>>
>>>>>>>need a facility to clean up HTML into its normal underlying content
>>>>>>>    
>>>>>>>
>>>>>>>         
>>>>>>>
>>>>>>>              
>>>>>>>
>>>>>>before
>>>>>>  
>>>>>>
>>>>>>       
>>>>>>
>>>>>>            
>>>>>>
>>>>>>>indexing that content as a whole.
>>>>>>>
>>>>>>>Karl
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>    
>>>>>>>
>>>>>>>         
>>>>>>>
>>>>>>>              
>>>>>>>
>>>>>>>>I think that depends on what you want to do.  The Lucene demo
parser
>>>>>>>>      
>>>>>>>>
>>>>>>>>           
>>>>>>>>
>>>>>>>>                
>>>>>>>>
>>>>>>does
>>>>>>  
>>>>>>
>>>>>>       
>>>>>>
>>>>>>            
>>>>>>
>>>>>>>>simple mapping of HTML files into Lucene Documents; it does
not give
>>>>>>>>           
>>>>>>>>
>>>>>>>>                
>>>>>>>>
>>>>you
>>>>   
>>>>
>>>>        
>>>>
>>>>>>>>      
>>>>>>>>
>>>>>>>>           
>>>>>>>>
>>>>>>>>                
>>>>>>>>
>>>>>>a
>>>>>>  
>>>>>>
>>>>>>       
>>>>>>
>>>>>>            
>>>>>>
>>>>>>>>parse tree for the HTML doc.  CyberNeko is an extension of
Xerces
>>>>>>>>           
>>>>>>>>
>>>>>>>>                
>>>>>>>>
>>>>(uses
>>>>   
>>>>
>>>>        
>>>>
>>>>>>>> 
>>>>>>>>
>>>>>>>>      
>>>>>>>>
>>>>>>>>           
>>>>>>>>
>>>>>>>>                
>>>>>>>>
>>>>>>>the
>>>>>>>
>>>>>>>
>>>>>>>    
>>>>>>>
>>>>>>>         
>>>>>>>
>>>>>>>              
>>>>>>>
>>>>>>>>same API; will likely become part of Xerces), and so maps
an HTML
>>>>>>>>      
>>>>>>>>
>>>>>>>>           
>>>>>>>>
>>>>>>>>                
>>>>>>>>
>>>>>>document
>>>>>>  
>>>>>>
>>>>>>       
>>>>>>
>>>>>>            
>>>>>>
>>>>>>>>into a full DOM that you can manipulate easily for a wide
range of
>>>>>>>>purposes.  I haven't used JTidy at an API level and so don't
know it
>>>>>>>>           
>>>>>>>>
>>>>>>>>                
>>>>>>>>
>>>>as
>>>>   
>>>>
>>>>        
>>>>
>>>>>>>> 
>>>>>>>>
>>>>>>>>      
>>>>>>>>
>>>>>>>>           
>>>>>>>>
>>>>>>>>                
>>>>>>>>
>>>>>>>well --
>>>>>>>
>>>>>>>
>>>>>>>    
>>>>>>>
>>>>>>>         
>>>>>>>
>>>>>>>              
>>>>>>>
>>>>>>>>based on its UI, it appears to be focused primarily on HTML
>>>>>>>>                
>>>>>>>>
>>validation
>>    
>>
>>>>>>>>      
>>>>>>>>
>>>>>>>>           
>>>>>>>>
>>>>>>>>                
>>>>>>>>
>>>>>>and
>>>>>>  
>>>>>>
>>>>>>       
>>>>>>
>>>>>>            
>>>>>>
>>>>>>>>error detection/correction.
>>>>>>>>
>>>>>>>>I use CyberNeko for a range of operations on HTML documents
that go
>>>>>>>>      
>>>>>>>>
>>>>>>>>           
>>>>>>>>
>>>>>>>>                
>>>>>>>>
>>>>>>beyond
>>>>>>  
>>>>>>
>>>>>>       
>>>>>>
>>>>>>            
>>>>>>
>>>>>>>>indexing them in Lucene, and really like it.  It has been
robust for
>>>>>>>>           
>>>>>>>>
>>>>>>>>                
>>>>>>>>
>>>>me
>>>>   
>>>>
>>>>        
>>>>
>>>>>>>>      
>>>>>>>>
>>>>>>>>           
>>>>>>>>
>>>>>>>>                
>>>>>>>>
>>>>>>so
>>>>>>  
>>>>>>
>>>>>>       
>>>>>>
>>>>>>            
>>>>>>
>>>>>>>>far.
>>>>>>>>
>>>>>>>>Chuck
>>>>>>>>
>>>>>>>>           
>>>>>>>>
>>>>>>>>                
>>>>>>>>
>>>>>>>>>-----Original Message-----
>>>>>>>>>From: Jingkang Zhang [mailto:zjingk@yahoo.com.cn]
>>>>>>>>>Sent: Tuesday, February 01, 2005 1:15 AM
>>>>>>>>>To: lucene-user@jakarta.apache.org
>>>>>>>>>Subject: which HTML parser is better?
>>>>>>>>>
>>>>>>>>>Three HTML parsers(Lucene web application
>>>>>>>>>demo,CyberNeko HTML Parser,JTidy) are mentioned in
>>>>>>>>>Lucene FAQ
>>>>>>>>>1.3.27.Which is the best?Can it filter tags that are
>>>>>>>>>auto-created by MS-word 'Save As HTML files' function?
>>>>>>>>>
>>>>>>>>>_________________________________________________________
>>>>>>>>>Do You Yahoo!?
>>>>>>>>>150ÍòÇúMP3·è¿ñËÑ£¬´øÄú´³ÈëÒôÀÖµîÌÃ
>>>>>>>>>http://music.yisou.com/
>>>>>>>>>ÃÀÅ®Ã÷ÐÇÓ¦Óо¡ÓУ¬ËѱéÃÀͼ¡¢ÑÞͼºÍ¿áͼ
>>>>>>>>>http://image.yisou.com
>>>>>>>>>1G¾ÍÊÇ1000Õ×£¬ÑÅ»¢µçÓÊ×ÔÖúÀ©ÈÝ£¡
>>>>>>>>>
>>>>>>>>>             
>>>>>>>>>
>>>>>>>>>                  
>>>>>>>>>
>>>>>>http://cn.rd.yahoo.com/mail_cn/tag/1g/*http://cn.mail.yahoo.com/event/ma
>>>>>>            
>>>>>>
>>>>>>>         
>>>>>>>
>>>>>>>              
>>>>>>>
>>>>>>>>>il_1g/
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>             
>>>>>>>>>
>>>>>>>>>                  
>>>>>>>>>
>>>>>>>>      
>>>>>>>>
>>>>>>>>           
>>>>>>>>
>>>>>>>>                
>>>>>>>>
>>>>>>---------------------------------------------------------------------
>>>>>>  
>>>>>>
>>>>>>       
>>>>>>
>>>>>>            
>>>>>>
>>>>>>>>>To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
>>>>>>>>>For additional commands, e-mail:
>>>>>>>>>             
>>>>>>>>>
>>>>>>>>>                  
>>>>>>>>>
>>>>lucene-user-help@jakarta.apache.org
>>>>   
>>>>
>>>>        
>>>>
>>>>>>>---------------------------------------------------------------------
>>>>>>>              
>>>>>>>
>>>>>>>>To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
>>>>>>>>For additional commands, e-mail: lucene-user-help@jakarta.apache.org
>>>>>>>>
>>>>>>>> 
>>>>>>>>
>>>>>>>>      
>>>>>>>>
>>>>>>>>           
>>>>>>>>
>>>>>>>>                
>>>>>>>>
>>>>>>>    
>>>>>>>
>>>>>>>         
>>>>>>>
>>>>>>>              
>>>>>>>
>>>>>>---------------------------------------------------------------------
>>>>>>To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
>>>>>>For additional commands, e-mail: lucene-user-help@jakarta.apache.org
>>>>>>
>>>>>>  
>>>>>>
>>>>>>       
>>>>>>
>>>>>>            
>>>>>>
>>>>>     
>>>>>
>>>>>          
>>>>>
>>>>---------------------------------------------------------------------
>>>>To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
>>>>For additional commands, e-mail: lucene-user-help@jakarta.apache.org
>>>>
>>>>   
>>>>
>>>>        
>>>>
>>> 
>>>
>>>      
>>>
>>---------------------------------------------------------------------
>>To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
>>For additional commands, e-mail: lucene-user-help@jakarta.apache.org
>>
>>    
>>
>
>  
>


---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org


Mime
View raw message