lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Erik Hatcher" <li...@ehatchersolutions.com>
Subject Re: [Help with upcomming submission/RFC] Excel Parser
Date Fri, 18 Jan 2002 18:37:52 GMT
----- Original Message -----
From: "Doug Cutting" <DCutting@grandcentral.com>

> The HTMLParser is just a "demo" because it's a hack.  I've always hoped
that
> someone would do a better version that we could proudly add to a "real"
> package.

I posted an HtmlDocument class that used JTidy to DOM'ify HTML documents and
create a Lucene Document object with Field's for title and body (stripping
HTML tags).

I'm not sure if it qualifies as "better", but its at least food for thought
and perhaps using JTidy is the best way to pull data out.

    Erik



--
To unsubscribe, e-mail:   <mailto:lucene-dev-unsubscribe@jakarta.apache.org>
For additional commands, e-mail: <mailto:lucene-dev-help@jakarta.apache.org>


Mime
View raw message