lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From sergiu gordea <gser...@ifit.uni-klu.ac.at>
Subject Re: which HTML parser is better?
Date Thu, 03 Feb 2005 07:08:46 GMT
Kauler, Leto S wrote:

Another very cheap, but robust solution in the case you use linux is to 
make lynx to parse your pages.

lynx page.html > page.txt.

This will strip out all html and  script, style, csimport tags. And you 
will have a .txt file ready for indexing.

  Best,

  Sergiu

>We index the content from HTML files and because we only want the "good"
>text and do not care about the structure, well-formedness, etc we went
>with regular expressions similar to what Luke Shannon offered.
>
>Only real difference being that we firstly remove entire blocks of
>(script|style|csimport) and similar since the contents of those are not
>useful for keyword searching, and afterward just remove every leftover
>HTML tags.  I have been meaning to add an expression to extract things
>like alt attribute text from <img> though.
>
>--Leto
>
>
>
>  
>
>>-----Original Message-----
>>From: Karl Koch [mailto:TheRanger@gmx.net] 
>>
>>I have  been following this thread and have another question. 
>>
>>Is there a piece of sourcecode (which is preferably very 
>>short and simple
>>(KISS)) which allows to remove all HTML tags from HTML 
>>content? HTML 3.2 would be enough...also no frames, CSS, etc. 
>>
>>I do not need to have the HTML strucutre tree or any other 
>>structure but need a facility to clean up HTML into its 
>>normal underlying content before indexing that content as a whole.
>>
>>Karl
>>
>>    
>>
>>>  > -----Original Message-----
>>>  > From: Jingkang Zhang [mailto:zjingk@yahoo.com.cn]
>>>  > Sent: Tuesday, February 01, 2005 1:15 AM
>>>  > To: lucene-user@jakarta.apache.org
>>>  > Subject: which HTML parser is better?
>>>  > 
>>>  > Three HTML parsers(Lucene web application
>>>  > demo,CyberNeko HTML Parser,JTidy) are mentioned in
>>>  > Lucene FAQ
>>>  > 1.3.27.Which is the best?Can it filter tags that are
>>>  > auto-created by MS-word 'Save As HTML files' function?
>>>  > 
>>>      
>>>
>
>CONFIDENTIALITY NOTICE AND DISCLAIMER
>
>Information in this transmission is intended only for the person(s) to whom it is addressed
and may contain privileged and/or confidential information. If you are not the intended recipient,
any disclosure, copying or dissemination of the information is unauthorised and you should
delete/destroy all copies and notify the sender. No liability is accepted for any unauthorised
use of the information contained in this transmission.
>
>This disclaimer has been automatically added.
>
>---------------------------------------------------------------------
>To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
>For additional commands, e-mail: lucene-user-help@jakarta.apache.org
>
>  
>


---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org


Mime
View raw message