lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Mike Klaas <>
Subject Re: How does HTMLStripWhitespaceTokenizerFactory work?
Date Mon, 11 Jun 2007 18:56:55 GMT
On 11-Jun-07, at 3:54 AM, Thierry Collogne wrote:

> Ok. Is it possible to get back the content without the html tags?

Well, it isn't stored anywhere in Solr.  It's best to think of lucene/ 
solr as two systems: the indexer applies a tokenization  
transformation to the data and creates an inverted index; the storage  
system keeps track of the data you give it _before_ analysis/ 
tokenization.  If there is analysis you'd like to do that also  
applies to the stored status of the doc, it's probably easier to  
apply it before passing the data to Solr.


> On 08/06/07, Yonik Seeley <> wrote:
>> On 6/8/07, Thierry Collogne <> wrote:
>> > I am trying to use the solr.HTMLStripWhitespaceTokenizerFactory  
>> analyzer
>> > with no luck.
>> [...]
>> > Is this normal? Shouldn't the html code and the white spaces be  
>> removed
>> from
>> > the field?
>> For indexing purposes, yes.  The stored field you get back will be
>> unchanged though.
>> If you want to see what will be indexed, try the analysis debugger in
>> the admin pages.
>> -Yonik

View raw message