lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "John Wang" <john.w...@gmail.com>
Subject Re: HTML text extraction
Date Thu, 22 Jun 2006 15:22:37 GMT
Hi Xuefeng:

     Can you please send me your htmlparser too?

thanks

-John

On 6/21/06, Daniel Noll <daniel@nuix.com.au> wrote:
>
> Simon Courtenage wrote:
> > I also use htmlparser, which is rather good.  I've had to customize it,
> > though, to parse strings containing
> > html source rather than accept urls of resources to fetch etc.  Also it
> > crashes on meta tags that don't have
> > name attributes (something I discovered only a couple of days ago).
>
> Actually, it already accepts strings without modifying the library:
>
>     String htmlSource = "<html>...</html>";
>     Parser parser = new Parser(new Lexer(htmlSource));
>
> I will have to watch out for those meta tags though.  Time to go test it.
>
> Daniel
>
>
> --
> Daniel Noll
>
> Nuix Pty Ltd
> Suite 79, 89 Jones St, Ultimo NSW 2007, Australia    Ph: +61 2 9280 0699
> Web: http://www.nuix.com.au/                        Fax: +61 2 9212 6902
>
> This message is intended only for the named recipient. If you are not
> the intended recipient you are notified that disclosing, copying,
> distributing or taking any action in reliance on the contents of this
> message or attachment is strictly prohibited.
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message