Return-Path: Delivered-To: apmail-lucene-solr-user-archive@locus.apache.org Received: (qmail 97273 invoked from network); 11 Jun 2007 10:54:46 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.2) by minotaur.apache.org with SMTP; 11 Jun 2007 10:54:46 -0000 Received: (qmail 23993 invoked by uid 500); 11 Jun 2007 10:54:48 -0000 Delivered-To: apmail-lucene-solr-user-archive@lucene.apache.org Received: (qmail 23962 invoked by uid 500); 11 Jun 2007 10:54:48 -0000 Mailing-List: contact solr-user-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: solr-user@lucene.apache.org Delivered-To: mailing list solr-user@lucene.apache.org Received: (qmail 23953 invoked by uid 99); 11 Jun 2007 10:54:48 -0000 Received: from herse.apache.org (HELO herse.apache.org) (140.211.11.133) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 11 Jun 2007 03:54:48 -0700 X-ASF-Spam-Status: No, hits=2.0 required=10.0 tests=HTML_MESSAGE,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (herse.apache.org: domain of thierry.collogne@gmail.com designates 209.85.134.184 as permitted sender) Received: from [209.85.134.184] (HELO mu-out-0910.google.com) (209.85.134.184) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 11 Jun 2007 03:54:44 -0700 Received: by mu-out-0910.google.com with SMTP id g7so834015muf for ; Mon, 11 Jun 2007 03:54:22 -0700 (PDT) DKIM-Signature: a=rsa-sha1; c=relaxed/relaxed; d=gmail.com; s=beta; h=domainkey-signature:received:received:message-id:date:from:to:subject:in-reply-to:mime-version:content-type:references; b=mxTXY9QU5FsQP3rfp7bjU5KSSat1IwU5yMom0N3xX7o1xgsOuoeNc6X+iCKBa4Ji9zIcLhOcqk3x7luHncJXLSRujk5Zk6TyoTbKu611Gh3zP8hbYKyrC/gA7aiCROHLktwYRITqwwscrXk0M+fRj6scni7q/UDGCOO/nsKIF6Q= DomainKey-Signature: a=rsa-sha1; c=nofws; d=gmail.com; s=beta; h=received:message-id:date:from:to:subject:in-reply-to:mime-version:content-type:references; b=Ag4HdRzDtnVldBSms8N9Y5QtiYeox/2O+Te3oetjexlPz7y9Pi4bbU9QEBfeq6ygl/j3a55Rv6P5vtqaGLjv7WvD28hMSRNGBmTtxe0S552KEjZCYzbZj4J1WXtymp9reIRtUpVOsbeWD55mcjso1c8448GShx+h/TNATkJW2yo= Received: by 10.82.158.12 with SMTP id g12mr10615491bue.1181559259842; Mon, 11 Jun 2007 03:54:19 -0700 (PDT) Received: by 10.82.191.20 with HTTP; Mon, 11 Jun 2007 03:54:19 -0700 (PDT) Message-ID: Date: Mon, 11 Jun 2007 12:54:19 +0200 From: "Thierry Collogne" To: solr-user@lucene.apache.org Subject: Re: How does HTMLStripWhitespaceTokenizerFactory work? In-Reply-To: MIME-Version: 1.0 Content-Type: multipart/alternative; boundary="----=_Part_59655_22697878.1181559259775" References: X-Virus-Checked: Checked by ClamAV on apache.org ------=_Part_59655_22697878.1181559259775 Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit Content-Disposition: inline Ok. Is it possible to get back the content without the html tags? On 08/06/07, Yonik Seeley wrote: > > On 6/8/07, Thierry Collogne wrote: > > I am trying to use the solr.HTMLStripWhitespaceTokenizerFactory analyzer > > with no luck. > [...] > > Is this normal? Shouldn't the html code and the white spaces be removed > from > > the field? > > For indexing purposes, yes. The stored field you get back will be > unchanged though. > If you want to see what will be indexed, try the analysis debugger in > the admin pages. > > -Yonik > ------=_Part_59655_22697878.1181559259775--