lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Erik Hatcher <e...@ehatchersolutions.com>
Subject Re: which HTML parser is better?
Date Wed, 02 Feb 2005 13:22:30 GMT

On Feb 2, 2005, at 6:17 AM, Karl Koch wrote:

> Hello,
>
> I have  been following this thread and have another question.
>
> Is there a piece of sourcecode (which is preferably very short and 
> simple
> (KISS)) which allows to remove all HTML tags from HTML content? HTML 
> 3.2
> would be enough...also no frames, CSS, etc.
>
> I do not need to have the HTML strucutre tree or any other structure 
> but
> need a facility to clean up HTML into its normal underlying content 
> before
> indexing that content as a whole.
>

The code in the Lucene Sandbox for parsing HTML with JTidy (under 
contributions/ant) for the <index> task does what you ask.

	Erik


---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org


Mime
View raw message