lucene-solr-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Apache Wiki <>
Subject [Solr Wiki] Update of "AnalyzersTokenizersTokenFilters" by ErickErickson
Date Fri, 08 Mar 2013 13:26:58 GMT
Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Solr Wiki" for change notification.

The "AnalyzersTokenizersTokenFilters" page has been changed by ErickErickson:

  A related technology to stemming is [[|lemmatization]],
which allows for "stemming" by expansion, taking a root word and 'expanding' it to all of
its various forms. Lemmatization can be used ''either'' at insertion time ''or'' at query
time. Lucene/Solr does not have built-in support for lemmatization but it can be simulated
by using your own dictionaries and the [[#SynonymFilter|SynonymFilterFactory]]
  See LanguageAnalysis for details about stemming for various languages.
+ <!> [[Solr4.3]]
+ A repeated question is "how can I have the original term contribute more to the score than
the stemmed version"? In Solr 4.3, the KeywordRepeatFilterFactory has been added to assist
this functionality. This filter emits two tokens for each input token, one of them is marked
with the Keyword attribute. Stemmers that respect keyword attributes will pass through the
token so marked without change. So the effect of this filter would be to index both the original
word and the stemmed version. The 4 stemmers listed above all respect the keyword attribute.
+ For terms that are not changed by stemming, this will result in duplicate, identical tokens
in the document. This can be alleviated by adding the RemoveDuplicatesTokenFilterFactory.
+ {{{
+ <fieldType name="text_keyword" class="solr.TextField" positionIncrementGap="100">
+  <analyzer>
+    <tokenizer class="solr.WhitespaceTokenizerFactory"/>
+    <filter class="solr.KeywordRepeatFilter"/>
+    <filter class="solr.PorterStemFilterFactory"/>
+    <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
+  </analyzer>
+ </fieldType>
+ }}}
  == Analyzers ==
  Analyzers are components that pre-process input text at index time and/or at  search time.
 It's important to use the same or similar analyzers that process text in a compatible manner
at index and query time.  For example, if an indexing analyzer lowercases words, then the
query analyzer should do the same to enable finding the indexed words.

View raw message