lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From sergiu gordea <>
Subject Re: which HTML parser is better?
Date Thu, 03 Feb 2005 07:08:46 GMT
Kauler, Leto S wrote:

Another very cheap, but robust solution in the case you use linux is to 
make lynx to parse your pages.

lynx page.html > page.txt.

This will strip out all html and  script, style, csimport tags. And you 
will have a .txt file ready for indexing.



>We index the content from HTML files and because we only want the "good"
>text and do not care about the structure, well-formedness, etc we went
>with regular expressions similar to what Luke Shannon offered.
>Only real difference being that we firstly remove entire blocks of
>(script|style|csimport) and similar since the contents of those are not
>useful for keyword searching, and afterward just remove every leftover
>HTML tags.  I have been meaning to add an expression to extract things
>like alt attribute text from <img> though.
>>-----Original Message-----
>>From: Karl Koch [] 
>>I have  been following this thread and have another question. 
>>Is there a piece of sourcecode (which is preferably very 
>>short and simple
>>(KISS)) which allows to remove all HTML tags from HTML 
>>content? HTML 3.2 would be enough...also no frames, CSS, etc. 
>>I do not need to have the HTML strucutre tree or any other 
>>structure but need a facility to clean up HTML into its 
>>normal underlying content before indexing that content as a whole.
>>>  > -----Original Message-----
>>>  > From: Jingkang Zhang []
>>>  > Sent: Tuesday, February 01, 2005 1:15 AM
>>>  > To:
>>>  > Subject: which HTML parser is better?
>>>  > 
>>>  > Three HTML parsers(Lucene web application
>>>  > demo,CyberNeko HTML Parser,JTidy) are mentioned in
>>>  > Lucene FAQ
>>>  > 1.3.27.Which is the best?Can it filter tags that are
>>>  > auto-created by MS-word 'Save As HTML files' function?
>>>  > 
>Information in this transmission is intended only for the person(s) to whom it is addressed
and may contain privileged and/or confidential information. If you are not the intended recipient,
any disclosure, copying or dissemination of the information is unauthorised and you should
delete/destroy all copies and notify the sender. No liability is accepted for any unauthorised
use of the information contained in this transmission.
>This disclaimer has been automatically added.
>To unsubscribe, e-mail:
>For additional commands, e-mail:

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message