lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Mike Klaas <mike.kl...@gmail.com>
Subject Re: How does HTMLStripWhitespaceTokenizerFactory work?
Date Mon, 11 Jun 2007 18:56:55 GMT
On 11-Jun-07, at 3:54 AM, Thierry Collogne wrote:

> Ok. Is it possible to get back the content without the html tags?
>

Well, it isn't stored anywhere in Solr.  It's best to think of lucene/ 
solr as two systems: the indexer applies a tokenization  
transformation to the data and creates an inverted index; the storage  
system keeps track of the data you give it _before_ analysis/ 
tokenization.  If there is analysis you'd like to do that also  
applies to the stored status of the doc, it's probably easier to  
apply it before passing the data to Solr.

-MIke

> On 08/06/07, Yonik Seeley <yonik@apache.org> wrote:
>>
>> On 6/8/07, Thierry Collogne <thierry.collogne@gmail.com> wrote:
>> > I am trying to use the solr.HTMLStripWhitespaceTokenizerFactory  
>> analyzer
>> > with no luck.
>> [...]
>> > Is this normal? Shouldn't the html code and the white spaces be  
>> removed
>> from
>> > the field?
>>
>> For indexing purposes, yes.  The stored field you get back will be
>> unchanged though.
>> If you want to see what will be indexed, try the analysis debugger in
>> the admin pages.
>>
>> -Yonik
>>


Mime
View raw message