lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From sergiu gordea <gser...@ifit.uni-klu.ac.at>
Subject Re: which HTML parser is better?
Date Wed, 02 Feb 2005 14:28:47 GMT
Karl Koch wrote:

>Hi,
>
>yes, but the library your are using is quite big. I was thinking that a 5kB
>code could actually do that. That sourceforge project is doing much more
>than that but I do not need it.
>  
>
you need just the htmlparser.jar 200k.
... you know ... the functionality is strongly correclated with the size.

  You can use 3 lines of code with a good regular expresion to eliminate 
the html tags,
but this won't give you any guarantie that the text from the bad 
fromated html files will be
correctly extracted...

  Best,

  Sergiu

>Karl
>
>  
>
>>  Hi Karl,
>>
>> I already submitted a peace of code that removes the html tags.
>> Search for my previous answer in this thread.
>>
>>  Best,
>>
>>   Sergiu
>>
>>Karl Koch wrote:
>>
>>    
>>
>>>Hello,
>>>
>>>I have  been following this thread and have another question. 
>>>
>>>Is there a piece of sourcecode (which is preferably very short and simple
>>>(KISS)) which allows to remove all HTML tags from HTML content? HTML 3.2
>>>would be enough...also no frames, CSS, etc. 
>>>
>>>I do not need to have the HTML strucutre tree or any other structure but
>>>need a facility to clean up HTML into its normal underlying content
>>>      
>>>
>>before
>>    
>>
>>>indexing that content as a whole.
>>>
>>>Karl
>>>
>>>
>>> 
>>>
>>>      
>>>
>>>>I think that depends on what you want to do.  The Lucene demo parser
>>>>        
>>>>
>>does
>>    
>>
>>>>simple mapping of HTML files into Lucene Documents; it does not give you
>>>>        
>>>>
>>a
>>    
>>
>>>>parse tree for the HTML doc.  CyberNeko is an extension of Xerces (uses
>>>>   
>>>>
>>>>        
>>>>
>>>the
>>> 
>>>
>>>      
>>>
>>>>same API; will likely become part of Xerces), and so maps an HTML
>>>>        
>>>>
>>document
>>    
>>
>>>>into a full DOM that you can manipulate easily for a wide range of
>>>>purposes.  I haven't used JTidy at an API level and so don't know it as
>>>>   
>>>>
>>>>        
>>>>
>>>well --
>>> 
>>>
>>>      
>>>
>>>>based on its UI, it appears to be focused primarily on HTML validation
>>>>        
>>>>
>>and
>>    
>>
>>>>error detection/correction.
>>>>
>>>>I use CyberNeko for a range of operations on HTML documents that go
>>>>        
>>>>
>>beyond
>>    
>>
>>>>indexing them in Lucene, and really like it.  It has been robust for me
>>>>        
>>>>
>>so
>>    
>>
>>>>far.
>>>>
>>>>Chuck
>>>>
>>>> > -----Original Message-----
>>>> > From: Jingkang Zhang [mailto:zjingk@yahoo.com.cn]
>>>> > Sent: Tuesday, February 01, 2005 1:15 AM
>>>> > To: lucene-user@jakarta.apache.org
>>>> > Subject: which HTML parser is better?
>>>> > 
>>>> > Three HTML parsers(Lucene web application
>>>> > demo,CyberNeko HTML Parser,JTidy) are mentioned in
>>>> > Lucene FAQ
>>>> > 1.3.27.Which is the best?Can it filter tags that are
>>>> > auto-created by MS-word 'Save As HTML files' function?
>>>> > 
>>>> > _________________________________________________________
>>>> > Do You Yahoo!?
>>>> > 150ÍòÇúMP3·è¿ñËÑ£¬´øÄú´³ÈëÒôÀÖµîÌÃ
>>>> > http://music.yisou.com/
>>>> > ÃÀÅ®Ã÷ÐÇÓ¦Óо¡ÓУ¬ËѱéÃÀͼ¡¢ÑÞͼºÍ¿áͼ
>>>> > http://image.yisou.com
>>>> > 1G¾ÍÊÇ1000Õ×£¬ÑÅ»¢µçÓÊ×ÔÖúÀ©ÈÝ£¡
>>>> >
>>>>http://cn.rd.yahoo.com/mail_cn/tag/1g/*http://cn.mail.yahoo.com/event/ma
>>>> > il_1g/
>>>> > 
>>>> >
>>>>        
>>>>
>>---------------------------------------------------------------------
>>    
>>
>>>> > To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
>>>> > For additional commands, e-mail: lucene-user-help@jakarta.apache.org
>>>>
>>>>
>>>>---------------------------------------------------------------------
>>>>To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
>>>>For additional commands, e-mail: lucene-user-help@jakarta.apache.org
>>>>
>>>>   
>>>>
>>>>        
>>>>
>>> 
>>>
>>>      
>>>
>>---------------------------------------------------------------------
>>To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
>>For additional commands, e-mail: lucene-user-help@jakarta.apache.org
>>
>>    
>>
>
>  
>


---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org


Mime
View raw message