Return-Path: Delivered-To: apmail-lucene-java-user-archive@www.apache.org Received: (qmail 28794 invoked from network); 11 Mar 2011 12:57:07 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.3) by minotaur.apache.org with SMTP; 11 Mar 2011 12:57:07 -0000 Received: (qmail 48076 invoked by uid 500); 11 Mar 2011 12:57:06 -0000 Delivered-To: apmail-lucene-java-user-archive@lucene.apache.org Received: (qmail 48018 invoked by uid 500); 11 Mar 2011 12:57:06 -0000 Mailing-List: contact java-user-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: java-user@lucene.apache.org Delivered-To: mailing list java-user@lucene.apache.org Received: (qmail 47984 invoked by uid 99); 11 Mar 2011 12:57:06 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 11 Mar 2011 12:57:06 +0000 X-ASF-Spam-Status: No, hits=0.6 required=5.0 tests=FREEMAIL_FROM,RCVD_IN_DNSWL_LOW,SPF_PASS,T_TO_NO_BRKTS_FREEMAIL,URI_HEX X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: domain of erickerickson@gmail.com designates 209.85.216.169 as permitted sender) Received: from [209.85.216.169] (HELO mail-qy0-f169.google.com) (209.85.216.169) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 11 Mar 2011 12:57:01 +0000 Received: by qyk2 with SMTP id 2so6404997qyk.14 for ; Fri, 11 Mar 2011 04:56:41 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=domainkey-signature:mime-version:in-reply-to:references:date :message-id:subject:from:to:content-type:content-transfer-encoding; bh=nuCIysYpvlzG8dh1eS7kPvc1rq+DtgKzAzHD5Q1dvJ4=; b=DvfX5UHc6+1XgzKGsSFDb8VkvK/8oqWKYFXGgcYkRGuZV+FgKOrNNm1W0GcjsTSsWh N2S7repM1p6K3CGToRdkj1Gs/T3bA+yIYiVAWnctsuMzT0+zWPlyNt+NiqPnpE+1Wm42 7HFxJcN4k4lMnk7jbRWmccB78rpFUquF1Q1hA= DomainKey-Signature: a=rsa-sha1; c=nofws; d=gmail.com; s=gamma; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :content-type:content-transfer-encoding; b=TLDTzmB15ZYBydCCueRme5JjGtJmnOfEu0z9MTpEpjoZUkNK8EnT/JljmB7mQGumdu DUC5XHZN05yBpTxpxLHVX5vSXWLxZjAE6bzVnuLR8qj+k7sqdBpj9rp3qp4epimlkPWd 950CDIMCbFSylpvefIpFbsBu+t8Imq1dKtjZQ= MIME-Version: 1.0 Received: by 10.229.65.33 with SMTP id g33mr6325344qci.294.1299848200968; Fri, 11 Mar 2011 04:56:40 -0800 (PST) Received: by 10.229.82.82 with HTTP; Fri, 11 Mar 2011 04:56:40 -0800 (PST) In-Reply-To: References: <1299841401493-2664316.post@n3.nabble.com> Date: Fri, 11 Mar 2011 07:56:40 -0500 Message-ID: Subject: Re: Which is the +best +fast HTML parser/tokenizer that I can use with Lucene for indexing HTML content today ? From: Erick Erickson To: java-user@lucene.apache.org Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: quoted-printable Solr doesn't do it. There exist various tokenizers/filters that just strip the HTML tags, but there's nothing built into Solr that I know of that understands HTML, HTML-aware operations are outside Solr's purview. Best Erick On Fri, Mar 11, 2011 at 6:50 AM, shrinath.m wrote: > On Fri, Mar 11, 2011 at 5:06 PM, Li Li [via Lucene] < > ml-node+2664380-1940163870-376162@n3.nabble.com> wrote: > >> =A0 But I think the parser will most be used when crawling. So you can u= se >> these parsers when crawling and save parsed result only. >> > > Consider we've offline HTML pages, no parsing while crawling, now what ? > Any tokenizer someone has built for this ? > > > How does Solr do it ? > > > -- > Regards > Shrinath.M > > > -- > View this message in context: http://lucene.472066.n3.nabble.com/Which-is= -the-best-fast-HTML-parser-tokenizer-that-I-can-use-with-Lucene-for-indexin= g-HTML-content-to-tp2664316p2664411.html > Sent from the Lucene - Java Users mailing list archive at Nabble.com. --------------------------------------------------------------------- To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org For additional commands, e-mail: java-user-help@lucene.apache.org