lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Erik Hatcher <e...@ehatchersolutions.com>
Subject Re: HTMLDocument
Date Sun, 01 Feb 2004 12:21:40 GMT
On Feb 1, 2004, at 6:19 AM, lucene@nitwit.de wrote:
> Hi!
>
> Is there any HTMLDocument out there? The one in the demo package of 
> lucene
> does not handle non-wellformed HTML files (what about nekohtml?) and 
> seems to
> have some other inabilities and bugs as well (and why isn't it part of 
> the
> distro but in a demo package?!)?

Nutch uses NekoHTML, so you can browse around that codebase and borrow 
its implementation.  The sandbox has a contribution/ant directory which 
contains an HTMLDocument that uses JTidy to parse HTML which does a 
pretty good job at handling bad HTML.

Why isn't it in the distribution?  Parsing HTML and turning it into a 
Lucene document is not always done the same way and doing so is really 
on top of the core, not integral to it.

	Erik


---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org


Mime
View raw message