lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Kauler, Leto S" <leto.kau...@education.tas.gov.au>
Subject RE: which HTML parser is better?
Date Wed, 02 Feb 2005 23:13:27 GMT
We index the content from HTML files and because we only want the "good"
text and do not care about the structure, well-formedness, etc we went
with regular expressions similar to what Luke Shannon offered.

Only real difference being that we firstly remove entire blocks of
(script|style|csimport) and similar since the contents of those are not
useful for keyword searching, and afterward just remove every leftover
HTML tags.  I have been meaning to add an expression to extract things
like alt attribute text from <img> though.

--Leto



> -----Original Message-----
> From: Karl Koch [mailto:TheRanger@gmx.net] 
> 
> I have  been following this thread and have another question. 
> 
> Is there a piece of sourcecode (which is preferably very 
> short and simple
> (KISS)) which allows to remove all HTML tags from HTML 
> content? HTML 3.2 would be enough...also no frames, CSS, etc. 
> 
> I do not need to have the HTML strucutre tree or any other 
> structure but need a facility to clean up HTML into its 
> normal underlying content before indexing that content as a whole.
> 
> Karl
> 
> > 
> >   > -----Original Message-----
> >   > From: Jingkang Zhang [mailto:zjingk@yahoo.com.cn]
> >   > Sent: Tuesday, February 01, 2005 1:15 AM
> >   > To: lucene-user@jakarta.apache.org
> >   > Subject: which HTML parser is better?
> >   > 
> >   > Three HTML parsers(Lucene web application
> >   > demo,CyberNeko HTML Parser,JTidy) are mentioned in
> >   > Lucene FAQ
> >   > 1.3.27.Which is the best?Can it filter tags that are
> >   > auto-created by MS-word 'Save As HTML files' function?
> >   > 

CONFIDENTIALITY NOTICE AND DISCLAIMER

Information in this transmission is intended only for the person(s) to whom it is addressed
and may contain privileged and/or confidential information. If you are not the intended recipient,
any disclosure, copying or dissemination of the information is unauthorised and you should
delete/destroy all copies and notify the sender. No liability is accepted for any unauthorised
use of the information contained in this transmission.

This disclaimer has been automatically added.

---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org


Mime
View raw message