Return-Path: Delivered-To: apmail-lucene-java-user-archive@www.apache.org Received: (qmail 12020 invoked from network); 12 Mar 2011 03:40:15 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.3) by minotaur.apache.org with SMTP; 12 Mar 2011 03:40:15 -0000 Received: (qmail 60524 invoked by uid 500); 12 Mar 2011 03:40:13 -0000 Delivered-To: apmail-lucene-java-user-archive@lucene.apache.org Received: (qmail 60488 invoked by uid 500); 12 Mar 2011 03:40:12 -0000 Mailing-List: contact java-user-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: java-user@lucene.apache.org Delivered-To: mailing list java-user@lucene.apache.org Received: (qmail 60480 invoked by uid 99); 12 Mar 2011 03:40:11 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Sat, 12 Mar 2011 03:40:11 +0000 X-ASF-Spam-Status: No, hits=-0.7 required=5.0 tests=FREEMAIL_FROM,RCVD_IN_DNSWL_LOW,SPF_PASS,T_TO_NO_BRKTS_FREEMAIL X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: domain of srssreejith@gmail.com designates 209.85.210.176 as permitted sender) Received: from [209.85.210.176] (HELO mail-iy0-f176.google.com) (209.85.210.176) by apache.org (qpsmtpd/0.29) with ESMTP; Sat, 12 Mar 2011 03:40:07 +0000 Received: by iyj12 with SMTP id 12so5058977iyj.35 for ; Fri, 11 Mar 2011 19:39:46 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=domainkey-signature:mime-version:in-reply-to:references:date :message-id:subject:from:to:content-type; bh=sPcqvm5KB6oJUoxutA4gLKmY2192gE6OL7mjlqMKYfc=; b=oIUs4W5Bk5K108e1I4Gd4EVh0MUvo+JMLO8QjLl3urPJVJLZDCagpqbBfekeA1e/f5 dAYT4K61Knvgt5eHzYwxtOXJ9l+lB99WU7nPMJKDWxcVXKZPMboCz2JR80pLTFe8KzyZ STu9/oDVEXfn/5bP84PEmLPZgLy5P7wwm5ADc= DomainKey-Signature: a=rsa-sha1; c=nofws; d=gmail.com; s=gamma; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :content-type; b=sUSJP5mH6zw24ti5CeYLgz7rFrm9/afWagX26ErwUjZUN1qnYtjK8Xbs1uZTBbd7kx SaDfsa5yr26t0FRgNEjmFdtLpto0qxGh+tk++EndVpmmmhLFXpb0xX/7RQCRXlOh9erO 6AyYOzzPTPAnompCjvn89y9lvgTyiaztdZMFQ= MIME-Version: 1.0 Received: by 10.231.43.14 with SMTP id u14mr7789841ibe.10.1299901186112; Fri, 11 Mar 2011 19:39:46 -0800 (PST) Received: by 10.231.157.17 with HTTP; Fri, 11 Mar 2011 19:39:46 -0800 (PST) In-Reply-To: <23383.1299858142@parc.com> References: <1299841401493-2664316.post@n3.nabble.com> <23383.1299858142@parc.com> Date: Fri, 11 Mar 2011 19:39:46 -0800 Message-ID: Subject: Re: Which is the +best +fast HTML parser/tokenizer that I can use with Lucene for indexing HTML content today ? From: Sreejith S To: java-user@lucene.apache.org Content-Type: text/plain; charset=ISO-8859-1 I suggest you Jsoup Html parser,which is fast ,easy and simple html parser.I used many html parsers and out of which i am comfortable with Jsoup. http://jsoup.org/ IBM ICU provides the best tokenizers. On 3/11/11, Bill Janssen wrote: > shrinath.m wrote: > >> Consider we've offline HTML pages, no parsing while crawling, now what ? >> Any tokenizer someone has built for this ? > > In UpLib, which uses PyLucene, I use BeautifulSoup to simplify Web pages > by selecting only text between certain tags, before indexing them. > These are offline Web pages, as in your application. Take a look at > . > > Bill > > --------------------------------------------------------------------- > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org > For additional commands, e-mail: java-user-help@lucene.apache.org > > -- ********************************* Sreejith.S http://sreejiths.emurse.com/ http://srijiths.wordpress.com/ tweet2sree@twitter ********************************* ILUGCBE http://ilugcbe.techstud.org --------------------------------------------------------------------- To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org For additional commands, e-mail: java-user-help@lucene.apache.org