lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "John Wang" <john.w...@gmail.com>
Subject Re: HTML text extraction
Date Wed, 21 Jun 2006 15:19:03 GMT
Thanks everyone for your responses!
I will try them out.

-John

On 6/20/06, Otis Gospodnetic <otis_gospodnetic@yahoo.com> wrote:
>
> John,
>
> I also wrote about using NekoHTML, I think.  I prefer that to JTidy.  That
> also tells you what Simpy.com uses.
>
> Otis
>
> ----- Original Message ----
> From: John Wang <john.wang@gmail.com>
> To: java-user@lucene.apache.org
> Sent: Wednesday, June 21, 2006 1:39:41 AM
> Subject: HTML text extraction
>
> Can someone please suggest a HTML text extraction library? In the Lucene
> book, it recommends Tidy. Seems jtidy is not really being maintained.
>
> Otis, what do you guys use at Simpy?
>
> Thanks
>
> -john
>
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message