lucene-solr-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Apache Wiki <wikidi...@apache.org>
Subject [Solr Wiki] Update of "AnalyzersTokenizersTokenFilters" by naomidushay
Date Mon, 01 Nov 2010 20:19:58 GMT
Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Solr Wiki" for change notification.

The "AnalyzersTokenizersTokenFilters" page has been changed by naomidushay.
The comment on this change is: clarifying word delimiter filter factory explanation.
http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters?action=diff&rev1=92&rev2=93

--------------------------------------------------

  ==== solr.WordDelimiterFilterFactory ====
  Creates `solr.analysis.WordDelimiterFilter`.
  
- Splits words into subwords and performs optional transformations on subword groups. Words
are split into subwords with the following rules:
+ Splits words into subwords and performs optional transformations on subword groups. By default,
words are split into subwords with the following rules:
  
   * split on intra-word delimiters (by default, all non alpha-numeric characters).
    * `"Wi-Fi" -> "Wi", "Fi"`
-  * split on case transitions
+  * split on case transitions (can be turned off - see splitOnCaseChange parameter)
    * `"PowerShot" -> "Power", "Shot"`
-  * split on letter-number transitions
+  * split on letter-number transitions (can be turned off - see splitOnNumerics parameter)
    * `"SD500" -> "SD", "500"`
   * leading and trailing intra-word delimiters on each subword are ignored
    * `"//hello---there, 'dude'" -> "hello", "there", "dude"`
-  * trailing "'s" are removed for each subword
+  * trailing "'s" are removed for each subword  (can be turned off - see stemEnglishPossessive
parameter)
    * `"O'Neil's" -> "O", "Neil"`
     * Note: this step isn't performed in a separate filter because of possible subword combinations.
  
- Splitting is affected by the following parameter:
+ Splitting is affected by the following parameters:
  
   * '''splitOnCaseChange="1"''' causes lowercase => uppercase transitions to generate
a new part [Solr 1.3]:
    * `"PowerShot" => "Power" "Shot"`
    * `"TransAM" => "Trans" "AM"`
+   * default is true ("1"); set to 0 to turn off
   * '''splitOnNumerics="1"''' causes alphabet => number transitions to generate a new
part [Solr 1.3]:
    * `"j2se" => "j" "2" "se"`
+   * default is true ("1"); set to 0 to turn off
   * '''stemEnglishPossessive="1"''' causes trailing "'s" to be removed for each subword.
    * `"Doug's" => "Doug"`
+   * default is true ("1"); set to 0 to turn off
  
  Note that this is the default behaviour in all released versions of Solr.
  
@@ -360, +363 @@

   * '''generateWordParts="1"''' causes parts of words to be generated:
    * `"PowerShot" => "Power" "Shot"` (if `splitOnCaseChange=1`)
    * `"Power-Shot" => "Power" "Shot"`
+   * default is 0
   * '''generateNumberParts="1"''' causes number subwords to be generated:
    * `"500-42" => "500" "42"`
+   * default is 0
   * '''catenateWords="1"''' causes maximum runs of word parts to be catenated:
    * `"wi-fi" => "wifi"`
+   * default is 0
   * '''catenateNumbers="1"''' causes maximum runs of number parts to be catenated:
    * `"500-42" => "50042"`
+   * default is 0
   * '''catenateAll="1"''' causes all subword parts to be catenated:
    * `"wi-fi-4000" => "wifi4000"`
+   * default is 0
   * '''preserveOriginal="1"''' causes the original token to be indexed without modifications
(in addition to the tokens produced due to other options)
+   * default is 0
  
  These parameters may be combined in any way.
  

Mime
View raw message