lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Erik Hatcher <>
Subject Re: which HTML parser is better?
Date Wed, 02 Feb 2005 13:22:30 GMT

On Feb 2, 2005, at 6:17 AM, Karl Koch wrote:

> Hello,
> I have  been following this thread and have another question.
> Is there a piece of sourcecode (which is preferably very short and 
> simple
> (KISS)) which allows to remove all HTML tags from HTML content? HTML 
> 3.2
> would be enough...also no frames, CSS, etc.
> I do not need to have the HTML strucutre tree or any other structure 
> but
> need a facility to clean up HTML into its normal underlying content 
> before
> indexing that content as a whole.

The code in the Lucene Sandbox for parsing HTML with JTidy (under 
contributions/ant) for the <index> task does what you ask.


To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message