Return-Path: Delivered-To: apmail-lucene-java-user-archive@www.apache.org Received: (qmail 55573 invoked from network); 22 Jun 2006 15:23:14 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (209.237.227.199) by minotaur.apache.org with SMTP; 22 Jun 2006 15:23:14 -0000 Received: (qmail 8762 invoked by uid 500); 22 Jun 2006 15:23:01 -0000 Delivered-To: apmail-lucene-java-user-archive@lucene.apache.org Received: (qmail 8658 invoked by uid 500); 22 Jun 2006 15:23:00 -0000 Mailing-List: contact java-user-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: java-user@lucene.apache.org Delivered-To: mailing list java-user@lucene.apache.org Received: (qmail 8613 invoked by uid 99); 22 Jun 2006 15:23:00 -0000 Received: from asf.osuosl.org (HELO asf.osuosl.org) (140.211.166.49) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 22 Jun 2006 08:23:00 -0700 X-ASF-Spam-Status: No, hits=0.0 required=10.0 tests=HTML_MESSAGE,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (asf.osuosl.org: domain of john.wang@gmail.com designates 64.233.182.187 as permitted sender) Received: from [64.233.182.187] (HELO nf-out-0910.google.com) (64.233.182.187) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 22 Jun 2006 08:22:58 -0700 Received: by nf-out-0910.google.com with SMTP id m18so193498nfc for ; Thu, 22 Jun 2006 08:22:37 -0700 (PDT) DomainKey-Signature: a=rsa-sha1; q=dns; c=nofws; s=beta; d=gmail.com; h=received:message-id:date:from:to:subject:in-reply-to:mime-version:content-type:references; b=nIkmtSz9jmtgZrlWrxI2E7Wf0OTsPYVFA6gNIyu4a0A8iSb2IDaXzqti72B1P8tRoSuAvaU5p5uXWN4un0zATC2H59jCu63+Wsot6QU4rZDSDBZXz8UPabCWbvYTSBp9kLHHWQJ0j13mXM04f9wIKVlG5O0O8lQ0aIblt7TSBmk= Received: by 10.49.60.17 with SMTP id n17mr1499268nfk; Thu, 22 Jun 2006 08:22:37 -0700 (PDT) Received: by 10.49.39.19 with HTTP; Thu, 22 Jun 2006 08:22:37 -0700 (PDT) Message-ID: <8837fb770606220822r13e59919i8d4aee11b15ed396@mail.gmail.com> Date: Thu, 22 Jun 2006 08:22:37 -0700 From: "John Wang" To: java-user@lucene.apache.org Subject: Re: HTML text extraction In-Reply-To: <4499D86B.1000802@nuix.com.au> MIME-Version: 1.0 Content-Type: multipart/alternative; boundary="----=_Part_9836_21461903.1150989757126" References: <8837fb770606202239s4f59fa04r91154093daa2c73@mail.gmail.com> <4498DE89.1080901@nuix.com.au> <4498F1C8.4030006@wmin.ac.uk> <4499D86B.1000802@nuix.com.au> X-Virus-Checked: Checked by ClamAV on apache.org X-Spam-Rating: minotaur.apache.org 1.6.2 0/1000/N ------=_Part_9836_21461903.1150989757126 Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit Content-Disposition: inline Hi Xuefeng: Can you please send me your htmlparser too? thanks -John On 6/21/06, Daniel Noll wrote: > > Simon Courtenage wrote: > > I also use htmlparser, which is rather good. I've had to customize it, > > though, to parse strings containing > > html source rather than accept urls of resources to fetch etc. Also it > > crashes on meta tags that don't have > > name attributes (something I discovered only a couple of days ago). > > Actually, it already accepts strings without modifying the library: > > String htmlSource = "..."; > Parser parser = new Parser(new Lexer(htmlSource)); > > I will have to watch out for those meta tags though. Time to go test it. > > Daniel > > > -- > Daniel Noll > > Nuix Pty Ltd > Suite 79, 89 Jones St, Ultimo NSW 2007, Australia Ph: +61 2 9280 0699 > Web: http://www.nuix.com.au/ Fax: +61 2 9212 6902 > > This message is intended only for the named recipient. If you are not > the intended recipient you are notified that disclosing, copying, > distributing or taking any action in reliance on the contents of this > message or attachment is strictly prohibited. > > --------------------------------------------------------------------- > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org > For additional commands, e-mail: java-user-help@lucene.apache.org > > ------=_Part_9836_21461903.1150989757126--