lucene-solr-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Apache Wiki <wikidi...@apache.org>
Subject [Solr Wiki] Update of "AnalyzersTokenizersTokenFilters" by israelekpo
Date Sat, 21 Aug 2010 23:30:41 GMT
Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Solr Wiki" for change notification.

The "AnalyzersTokenizersTokenFilters" page has been changed by israelekpo.
http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters?action=diff&rev1=84&rev2=85

--------------------------------------------------

  = Analyzers, Tokenizers, and Token Filters =
- 
  == Overview ==
- 
- When a document is indexed, its individual fields are subject to the analyzing and tokenizing
filters that can transform and normalize the data in the fields. For example &#151; removing
blank spaces, removing html code, stemming, removing a particular character and replacing
it with another. At indexing time as well as at query time you may need to do some of the
above or similiar operations. For example, you might perform a [[http://en.wikipedia.org/wiki/Soundex|Soundex]]
transformation (a type of phonic hashing) on a string to enable a search based upon the word
and upon its 'sound-alikes'.
+ When a document is indexed, its individual fields are subject to the analyzing and tokenizing
filters that can transform and normalize the data in the fields. For example — removing
blank spaces, removing html code, stemming, removing a particular character and replacing
it with another. At indexing time as well as at query time you may need to do some of the
above or similiar operations. For example, you might perform a [[http://en.wikipedia.org/wiki/Soundex|Soundex]]
transformation (a type of phonic hashing) on a string to enable a search based upon the word
and upon its 'sound-alikes'.
  
  The lists below provide an overview of '''''some''''' of the more heavily used Tokenizers
and !TokenFilters provided by Solr "out of the box" along with tips/examples of using them.
 '''This list should by no means be considered the "complete" list of all Analysis classes
available in Solr!'''  In addition to new classes being added on an ongoing basis, you can
load your own custom Analysis code as a [[SolrPlugins|Plugin]].
  
@@ -12, +10 @@

  
  For information about some language-specific !Tokenizers and !TokenFilters available in
Solr, please consult LanguageAnalysis.
  
- '''Note:'''
- For a good background on Lucene Analysis, it's recommended that you read the following sections
in [[http://lucenebook.com/|Lucene In Action]]:
+ '''Note:''' For a good background on Lucene Analysis, it's recommended that you read the
following sections in [[http://lucenebook.com/|Lucene In Action]]:
+ 
   * 1.5.3 : Analyzer
   * Chapter 4.0 through 4.7 at least
  
  Try searches for "analyzer", "token", and "stemming".
  
- 
  <<TableOfContents>>
  
  == Stemming ==
- 
  There are four types of stemming strategies:
+ 
-    * [[http://tartarus.org/~martin/PorterStemmer/|Porter]] or Reduction stemming &#151;
A transforming algorithm that reduces any of the forms of a word such as "runs, running, ran",
to its elemental root e.g., "run". Porter stemming must be performed ''both'' at insertion
time and at query time.
+  * [[http://tartarus.org/~martin/PorterStemmer/|Porter]] or Reduction stemming — A transforming
algorithm that reduces any of the forms of a word such as "runs, running, ran", to its elemental
root e.g., "run". Porter stemming must be performed ''both'' at insertion time and at query
time.
-    * [[http://code.google.com/p/lucene-hunspell/|Lucene-Hunspell]] aims to provide features
such as stemming, decompounding, spellchecking, normalization, term expansion, etc. taking
advantage of the existing lexical resources already created and widely-used in projects like
OpenOffice. This is still alpha-version but with an impressive list of supported languages
(See [[http://lucene-eurocon.org/sessions-track2-day2.html#5|this presentation]] for more)
+  * [[http://code.google.com/p/lucene-hunspell/|Lucene-Hunspell]] aims to provide features
such as stemming, decompounding, spellchecking, normalization, term expansion, etc. taking
advantage of the existing lexical resources already created and widely-used in projects like
OpenOffice. This is still alpha-version but with an impressive list of supported languages
(See [[http://lucene-eurocon.org/sessions-track2-day2.html#5|this presentation]] for more)
-    * Expansion stemming &#151; Takes a root word and 'expands' it to all of its various
forms &#151; can be used ''either'' at insertion time ''or'' at query time.  One way to
approach this is by using the [[#SynonymFilter|SynonymFilterFactory]]
+  * Expansion stemming — Takes a root word and 'expands' it to all of its various forms
— can be used ''either'' at insertion time ''or'' at query time.  One way to approach this
is by using the [[#SynonymFilter|SynonymFilterFactory]]
-    * [[/Kstem|KStem]], an alternative to Porter for developers looking for a less agressive
stemmer.
+  * [[AnalyzersTokenizersTokenFilters/Kstem|KStem]], an alternative to Porter for developers
looking for a less agressive stemmer.
  
  == Analyzers ==
- 
  Analyzers are components that pre-process input text at index time and/or at  search time.
 It's important to use the same or similar analyzers that process text in a compatible manner
at index and query time.  For example, if an indexing analyzer lowercases words, then the
query analyzer should do the same to enable finding the indexed words.
  
  On wildcard and fuzzy searches, no text analysis is performed on the search word.
@@ -39, +35 @@

  The Analyzer class is an abstract class, but Lucene comes with a few concrete Analyzers
that pre-process their input in different ways. If you need to pre-process input text and
queries in a way that is not provided by any of Lucene's built-in Analyzers, you will need
to specify a custom Analyzer in the Solr schema.
  
  == Char Filters ==
- 
  <!> [[Solr1.4]]
  
  Char Filter is a component that pre-processes input characters. It can be chained like as
Token Filters and placed in front of a Tokenizer. Char Filters can add, change, or remove
characters without worrying about fault of Token offsets.
  
  == Tokens and Token Filters ==
- 
  An analyzer splits up a text field into tokens that the field is indexed by. An Analyzer
is normally implemented by creating a '''Tokenizer''' that splits-up a stream (normally a
single field value) into a series of tokens. These tokens are then passed through a series
of Token Filters that add, change, or remove tokens. The field is then indexed by the resulting
token stream.
  
  The Solr web admin interface may be used to show the results of text analysis, and even
the results after each analysis phase if a custom analyzer is used.
  
  == Specifying an Analyzer in the schema ==
- 
  A Solr schema.xml file allows two methods for specifying the way a text field is analyzed.
(Normally only field types of `solr.TextField` will have Analyzers explicitly specified in
the schema):
  
-   1.  Specifying the '''class name''' of an Analyzer &#151; anything extending org.apache.lucene.analysis.Analyzer.
<<BR>> Example: <<BR>> {{{
+  1. Specifying the '''class name''' of an Analyzer — anything extending org.apache.lucene.analysis.Analyzer.
<<BR>> Example: <<BR>>
+  {{{
  <fieldtype name="nametext" class="solr.TextField">
    <analyzer class="org.apache.lucene.analysis.WhitespaceAnalyzer"/>
  </fieldtype>
  }}}
-   1.  Specifying a '''!TokenizerFactory''' followed by a list of optional !TokenFilterFactories
that are applied in the listed order. Factories that can create the tokenizers or token filters
are used to prepare configuration for the tokenizer or filter and avoid the overhead of creation
via reflection. <<BR>> Example: <<BR>> {{{
+  1. Specifying a '''!TokenizerFactory''' followed by a list of optional !TokenFilterFactories
that are applied in the listed order. Factories that can create the tokenizers or token filters
are used to prepare configuration for the tokenizer or filter and avoid the overhead of creation
via reflection. <<BR>> Example: <<BR>>
+  {{{
  <fieldtype name="text" class="solr.TextField">
    <analyzer>
      <tokenizer class="solr.StandardTokenizerFactory"/>
@@ -74, +69 @@

  Any Analyzer, !TokenizerFactory, or !TokenFilterFactory may be specified using its full
class name with package -- just make sure they are in Solr's classpath when you start your
appserver.  Classes in the `org.apache.solr.analysis.*` package can be referenced using the
short alias `solr.*`.
  
  If you want to use custom Tokenizers or !TokenFilters, you'll need to write a very simple
factory that subclasses !BaseTokenizerFactory or !BaseTokenFilterFactory, something like this...
+ 
  {{{
  public class MyCustomFilterFactory extends BaseTokenFilterFactory {
    public TokenStream create(TokenStream input) {
@@ -81, +77 @@

    }
  }
  }}}
- 
  === CharFilterFactories ===
- 
  <!> [[Solr1.4]]
+ 
  ==== Example ====
  {{{
  <fieldType name="charfilthtmlmap" class="solr.TextField">
@@ -95, +90 @@

        </analyzer>
      </fieldType>
  }}}
- 
  ==== solr.MappingCharFilterFactory ====
- 
  Creates `org.apache.lucene.analysis.MappingCharFilter`.
  
  ==== solr.PatternReplaceCharFilterFactory ====
- 
  Creates `org.apache.solr.analysis.PatternReplaceCharFilter`. Applies a regex pattern to
string in char stream, replacing match occurances with the specified replacement string.
  
  ==== solr.HTMLStripCharFilterFactory ====
- 
  Creates `org.apache.solr.analysis.HTMLStripCharFilter`. `HTMLStripCharFilter` strips HTML
from the input stream and passes the result to either `CharFilter` or `Tokenizer`.
  
  HTML stripping features:
+ 
   * The input need not be an HTML document as only constructs that look like HTML will be
removed.
   * Removes HTML/XML tags while keeping the content
-    * Attributes within tags are also removed, and attribute quoting is optional.
+   * Attributes within tags are also removed, and attribute quoting is optional.
   * Removes XML processing instructions: <?foo bar?>
   * Removes XML comments
   * Removes XML elements starting with <! and ending with >
   * Removes contents of <script> and <style> elements.
-    * Handles XML comments inside these elements (normal comment processing won't always
work)
+   * Handles XML comments inside these elements (normal comment processing won't always work)
-    * Replaces numeric character entities references like {{{&#65;}}} or {{{&#x7f;}}}
+   * Replaces numeric character entities references like {{{&#65;}}} or {{{&#x7f;}}}
-      * The terminating ';' is optional if the entity reference is followed by whitespace.
+    * The terminating ';' is optional if the entity reference is followed by whitespace.
-    * Replaces all [[http://www.w3.org/TR/REC-html40/sgml/entities.html|named character entity
references]].
+   * Replaces all [[http://www.w3.org/TR/REC-html40/sgml/entities.html|named character entity
references]].
-      * &nbsp; is replaced with a space instead of 0xa0
+    *   is replaced with a space instead of 0xa0
-      * terminating ';' is mandatory to avoid false matches on something like "Alpha&Omega
Corp"
+    * terminating ';' is mandatory to avoid false matches on something like "Alpha&Omega
Corp"
  
  HTML stripping examples:
- 
- || my <a href="www.foo.bar">link</a> || my link ||
+ ||my <a href="www.foo.bar">link</a> ||my link ||
- || <?xml?><br>hello<!--comment--> || hello ||
+ ||<?xml?><br>hello<!--comment--> ||hello ||
- || hello<script><-- f('<--internal--></script>'); --></script>
|| hello ||
+ ||hello<script><-- f('<--internal--></script>'); --></script>
||hello ||
- || if a<b then print a; || if a<b then print a; ||
+ ||if a<b then print a; ||if a<b then print a; ||
- || hello <td height=22 nowrap align="left"> || hello ||
+ ||hello <td height=22 nowrap align="left"> ||hello ||
- || a&lt;b &#65 Alpha&Omega &Omega; || a<b A Alpha&Omega Ω ||
+ ||a<b &#65 Alpha&Omega Ω ||a<b A Alpha&Omega Ω ||
+ 
+ 
+ 
  
  === TokenizerFactories ===
- 
  Solr provides the following  !TokenizerFactories (Tokenizers and !TokenFilters):
  
  ==== solr.LetterTokenizerFactory ====
- 
  Creates `org.apache.lucene.analysis.LetterTokenizer`.
  
  Creates tokens consisting of strings of contiguous letters. Any non-letter characters will
be discarded.
  
-   Example: `"I can't" ==> "I", "can", "t"`
+  . Example: `"I can't" ==> "I", "can", "t"`
  
  <<Anchor(WhitespaceTokenizer)>>
+ 
  ==== solr.WhitespaceTokenizerFactory ====
- 
  Creates `org.apache.lucene.analysis.WhitespaceTokenizer`.
  
  Creates tokens of characters separated by splitting on whitespace.
  
  ==== solr.LowerCaseTokenizerFactory ====
- 
  Creates `org.apache.lucene.analysis.LowerCaseTokenizer`.
  
  Creates tokens by lowercasing all letters and dropping non-letters.
  
-   Example: `"I can't" ==> "i", "can", "t"`
+  . Example: `"I can't" ==> "i", "can", "t"`
  
  <<Anchor(StandardTokenizer)>>
+ 
  ==== solr.StandardTokenizerFactory ====
- 
  Creates `org.apache.lucene.analysis.standard.StandardTokenizer`.
  
  A good general purpose tokenizer that strips many extraneous characters and sets token types
to meaningful values.  Token types are only useful for subsequent token filters that are type-aware.
 The !StandardFilter is currently the only Lucene filter that utilizes token types.
  
- Some token types are number, alphanumeric, email, acronym, URL, etc. &#151;
+ Some token types are number, alphanumeric, email, acronym, URL, etc. —
  
-   Example: `"I.B.M. cat's can't" ==> ACRONYM: "I.B.M.", APOSTROPHE:"cat's", APOSTROPHE:"can't"`
+  . Example: `"I.B.M. cat's can't" ==> ACRONYM: "I.B.M.", APOSTROPHE:"cat's", APOSTROPHE:"can't"`
  
  <<Anchor(HTMLStripWhitespaceTokenizer)>>
+ 
  ==== solr.HTMLStripWhitespaceTokenizerFactory ====
- 
  Strips HTML from the input stream and passes the result to a !WhitespaceTokenizer.
  
  See {{{solr.HTMLStripCharFilterFactory}}} for details on HTML stripping.
  
  ==== solr.HTMLStripStandardTokenizerFactory ====
- 
  Strips HTML from the input stream and passes the result to a !StandardTokenizer.
  
  See {{{solr.HTMLStripCharFilterFactory}}} for details on HTML stripping.
  
  ==== solr.PatternTokenizerFactory ====
- 
  Breaks text at the specified regular expression pattern.
  
- For example, you have a list of terms, delimited by a semicolon and zero or more spaces:
+ For example, you have a list of terms, delimited by a semicolon and zero or more spaces:
`mice; kittens; dogs`.
- `mice; kittens; dogs`.
- 
  
  {{{
     <fieldType name="semicolonDelimited" class="solr.TextField">
@@ -198, +185 @@

        </analyzer>
     </fieldType>
  }}}
- 
  See the javadoc for details.
  
  === TokenFilterFactories ===
- 
  <<Anchor(StandardFilter)>>
+ 
  ==== solr.StandardFilterFactory ====
- 
  Creates `org.apache.lucene.analysis.standard.StandardFilter`.
  
  Removes dots from acronyms and 's from the end of tokens. Works only on typed tokens, i.e.,
those produced by !StandardTokenizer or equivalent.
  
-   Example of !StandardTokenizer followed by !StandardFilter:
+  . Example of !StandardTokenizer followed by !StandardFilter:
-      `"I.B.M. cat's can't" ==> "IBM", "cat", "can't"`
+   . `"I.B.M. cat's can't" ==> "IBM", "cat", "can't"`
  
  <<Anchor(LowerCaseFilter)>>
+ 
  ==== solr.LowerCaseFilterFactory ====
- 
  Creates `org.apache.lucene.analysis.LowerCaseFilter`.
  
  Lowercases the letters in each token. Leaves non-letter tokens alone.
  
-   Example: `"I.B.M.", "Solr" ==> "i.b.m.", "solr"`.
+  . Example: `"I.B.M.", "Solr" ==> "i.b.m.", "solr"`.
- 
  
  <<Anchor(TrimFilter)>>
+ 
  ==== solr.TrimFilterFactory ====
- 
  <!> [[Solr1.2]]
  
  Creates `org.apache.solr.analysis.TrimFilter`.
  
  Trims whitespace at either end of a token.
  
-   Example: `" Kittens!   ", "Duck" ==> "Kittens!", "Duck"`.
+  . Example: `" Kittens!   ", "Duck" ==> "Kittens!", "Duck"`.
  
  Optionally, the "updateOffsets" attribute will update the start and end position offsets.
  
- 
  <<Anchor(StopFilter)>>
+ 
  ==== solr.StopFilterFactory ====
- 
  Creates `org.apache.lucene.analysis.StopFilter`.
  
  Discards common words.
  
  The default English stop words are:
+ 
  {{{
      "a", "an", "and", "are", "as", "at", "be", "but", "by",
      "for", "if", "in", "into", "is", "it",
@@ -252, +236 @@

      "t", "that", "the", "their", "then", "there", "these",
      "they", "this", "to", "was", "will", "with"
  }}}
+ A customized stop word list may be specified with the "words" attribute in the schema. Optionally,
the "ignoreCase" attribute may be used to ignore the case of tokens when comparing to the
stopword list.
- 
- A customized stop word list may be specified with the "words" attribute in the schema.
- Optionally, the "ignoreCase" attribute may be used to ignore the case of tokens when comparing
to the stopword list.
  
  {{{
  <fieldtype name="teststop" class="solr.TextField">
@@ -264, +246 @@

     </analyzer>
  </fieldtype>
  }}}
- 
  <<Anchor(CommonGramsFilter)>>
+ 
  ==== solr.CommonGramsFilterFactory ====
- 
  Creates `org.apache.solr.analysis.CommonGramsFilter`. <!> [[Solr1.4]]
  
  Makes shingles (i.e. the_cat) by combining common tokens (usually the same as the stop words
list) and regular tokens.  CommonGramsFilter is useful for issuing phrase queries (i.e. "the
cat") that contain stop words.  Normally phrases containing stop words would not match their
intended target and instead, the query "the cat" would match all documents containing "cat",
which can be undesirable behavior.  Phrase query slop (eg, "the cat"~2) will not function
as intended because common grams are indexed as shingled tokens that are adjacent to each
other (i.e. the_cat is indexed as a single term).  The CommonGramsQueryFilter converts the
phrase query "the cat" into the single term query the_cat.
  
+ A customized common word list may be specified with the "words" attribute in the schema.
Optionally, the "ignoreCase" attribute may be used to ignore the case of tokens when comparing
to the common words list.
- A customized common word list may be specified with the "words" attribute in the schema.
- Optionally, the "ignoreCase" attribute may be used to ignore the case of tokens when comparing
to the common words list.
  
  {{{
  <fieldtype name="testcommongrams" class="solr.TextField">
@@ -283, +263 @@

     </analyzer>
  </fieldtype>
  }}}
+ <<Anchor('''EdgeNGramFilter''')>>
  
+ '''solr.EdgeNGramFilterFactory'''
+ 
+ By default, create n-grams from the beginning edge of a input token.
+ 
+ With the configuration below the string value '''Nigerian''' gets broken down to the following
terms
+ 
+ Nigerian => "ni", "nig", "nige", "niger", "nigeri", "nigeria", "nigeria", "nigerian"
+ 
+ By default, minGramSize is 1, maxGramSize is 1 and side is "front". You can also set side
to generate the ngrams from right to left by setting "side" to a value of "back"
+ 
+ This FilterFactory is very useful in matching substrings of particular terms in the index
during query time.
+ 
+ {{{
+ <fieldtype name="testedgengrams" class="solr.TextField">
+    <analyzer>
+      <tokenizer class="solr.LowerCaseTokenizerFactory"/>
+      <filter class="solr.EdgeNGramFilterFactory" minGramSize="2" maxGramSize="15" side="front"/>
+    </analyzer>
+ </fieldtype>
+ }}}
  <<Anchor(KeepWordFilter)>>
+ 
  ==== solr.KeepWordFilterFactory ====
- 
  Creates `org.apache.solr.analysis.KeepWordFilter`. <!> [[Solr1.3]]
  
  Keep words on a list.  This is the inverse behavior of StopFilterFactory.  The word file
format is identical.
@@ -298, +299 @@

     </analyzer>
  </fieldtype>
  }}}
- 
- 
  <<Anchor(LengthFilter)>>
+ 
  ==== solr.LengthFilterFactory ====
- 
  Creates `solr.LengthFilter`.
  
  Filters out those tokens *not* having length min through max inclusive.
+ 
  {{{
  <fieldtype name="lengthfilt" class="solr.TextField">
    <analyzer>
@@ -314, +314 @@

    </analyzer>
  </fieldtype>
  }}}
- 
  <<Anchor(WordDelimiterFilter)>>
+ 
  ==== solr.WordDelimiterFilterFactory ====
- 
  Creates `solr.analysis.WordDelimiterFilter`.
  
- Splits words into subwords and performs optional transformations on subword groups.
- Words are split into subwords with the following rules:
+ Splits words into subwords and performs optional transformations on subword groups. Words
are split into subwords with the following rules:
+ 
   * split on intra-word delimiters (by default, all non alpha-numeric characters).
-    * `"Wi-Fi" -> "Wi", "Fi"`
+   * `"Wi-Fi" -> "Wi", "Fi"`
   * split on case transitions
-    * `"PowerShot" -> "Power", "Shot"`
+   * `"PowerShot" -> "Power", "Shot"`
   * split on letter-number transitions
-    * `"SD500" -> "SD", "500"`
+   * `"SD500" -> "SD", "500"`
   * leading and trailing intra-word delimiters on each subword are ignored
-    * `"//hello---there, 'dude'" -> "hello", "there", "dude"`
+   * `"//hello---there, 'dude'" -> "hello", "there", "dude"`
   * trailing "'s" are removed for each subword
-    * `"O'Neil's" -> "O", "Neil"`
+   * `"O'Neil's" -> "O", "Neil"`
-      * Note: this step isn't performed in a separate filter because of possible subword
combinations.
+    * Note: this step isn't performed in a separate filter because of possible subword combinations.
  
  Splitting is affected by the following parameter:
+ 
   * '''splitOnCaseChange="1"''' causes lowercase => uppercase transitions to generate
a new part [Solr 1.3]:
-    * `"PowerShot" => "Power" "Shot"`
+   * `"PowerShot" => "Power" "Shot"`
-    * `"TransAM" => "Trans" "AM"`
+   * `"TransAM" => "Trans" "AM"`
   * '''splitOnNumerics="1"''' causes alphabet => number transitions to generate a new
part [Solr 1.3]:
-    * `"j2se" => "j" "2" "se"`
+   * `"j2se" => "j" "2" "se"`
   * '''stemEnglishPossessive="1"''' causes trailing "'s" to be removed for each subword.
-    * `"Doug's" => "Doug"`
+   * `"Doug's" => "Doug"`
  
  Note that this is the default behaviour in all released versions of Solr.
  
  There are also a number of parameters that affect what tokens are present in the final output
and if subwords are combined:
+ 
   * '''generateWordParts="1"''' causes parts of words to be generated:
-    * `"PowerShot" => "Power" "Shot"` (if `splitOnCaseChange=1`)
+   * `"PowerShot" => "Power" "Shot"` (if `splitOnCaseChange=1`)
-    * `"Power-Shot" => "Power" "Shot"`
+   * `"Power-Shot" => "Power" "Shot"`
   * '''generateNumberParts="1"''' causes number subwords to be generated:
-    * `"500-42" => "500" "42"`
+   * `"500-42" => "500" "42"`
   * '''catenateWords="1"''' causes maximum runs of word parts to be catenated:
-     * `"wi-fi" => "wifi"`
+   * `"wi-fi" => "wifi"`
   * '''catenateNumbers="1"''' causes maximum runs of number parts to be catenated:
-    * `"500-42" => "50042"`
+   * `"500-42" => "50042"`
   * '''catenateAll="1"''' causes all subword parts to be catenated:
-    * `"wi-fi-4000" => "wifi4000"`
+   * `"wi-fi-4000" => "wifi4000"`
   * '''preserveOriginal="1"''' causes the original token to be indexed without modifications
(in addition to the tokens produced due to other options)
  
  These parameters may be combined in any way.
+ 
   * Example of generateWordParts="1" and  catenateWords="1":
-    * `"PowerShot" -> 0:"Power", 1:"Shot" 1:"PowerShot"` <<BR>> (where 0,1,1
are token positions)
+   * `"PowerShot" -> 0:"Power", 1:"Shot" 1:"PowerShot"` <<BR>> (where 0,1,1
are token positions)
-    * `"A's+B's&C's" -> 0:"A", 1:"B", 2:"C", 2:"ABC"`
+   * `"A's+B's&C's" -> 0:"A", 1:"B", 2:"C", 2:"ABC"`
-    * `"Super-Duper-XL500-42-AutoCoder!" -> 0:"Super", 1:"Duper", 2:"XL", 2:"SuperDuperXL",
3:"500" 4:"42", 5:"Auto", 6:"Coder", 6:"AutoCoder"`
+   * `"Super-Duper-XL500-42-AutoCoder!" -> 0:"Super", 1:"Duper", 2:"XL", 2:"SuperDuperXL",
3:"500" 4:"42", 5:"Auto", 6:"Coder", 6:"AutoCoder"`
  
  One use for !WordDelimiterFilter is to help match words with [[SolrRelevancyCookbook#IntraWordDelimiters|different
delimiters]].  One way of doing so is to specify `generateWordParts="1" catenateWords="1"`
in the analyzer used for indexing, and `generateWordParts="1"` in the analyzer used for querying.
 Given that the current !StandardTokenizer immediately removes many intra-word delimiters,
it is recommended that this filter be used after a tokenizer that leaves them in place (such
as !WhitespaceTokenizer).
  
@@ -399, +401 @@

        </analyzer>
      </fieldtype>
  }}}
- 
  <<Anchor(SynonymFilter)>>
+ 
  ==== solr.SynonymFilterFactory ====
- 
  Creates `SynonymFilter`.
  
  Matches strings of tokens and replaces them with other strings of tokens.
@@ -412, +413 @@

   * If '''expand''' is true, a synonym will be expanded to all equivalent synonyms.  If it
is false, all equivalent synonyms will be reduced to the first in the list.
  
  Example usage in schema:
+ 
  {{{
      <fieldtype name="syn" class="solr.TextField">
        <analyzer>
@@ -420, +422 @@

        </analyzer>
      </fieldtype>
  }}}
- 
  Synonym file format:
+ 
  {{{
  # blank lines and lines starting with pound are comments.
  
@@ -451, +453 @@

  foo => baz
  #is equivalent to
  foo => foo bar, baz
- 
  }}}
- 
  Keep in mind that while the !SynonymFilter will happily work with synonyms containing multiple
words (ie: "`sea biscuit, sea biscit, seabiscuit`") The recommended approach for dealing with
synonyms like this, is to expand the synonym when indexing.  This is because there are two
potential issues that can arrise at query time:
  
   1. The Lucene !QueryParser tokenizes on white space before giving any text to the Analyzer,
so if a person searches for the words `sea biscit` the analyzer will be given the words "sea"
and "biscit" seperately, and will not know that they match a synonym.
@@ -461, +461 @@

  
  Even when you aren't worried about multi-word synonyms, idf differences still make index
time synonyms a good idea. Consider the following scenario:
  
-    * An index with a "text" field, which at query time uses the !SynonymFilter with the
synonym `TV, Televesion` and `expand="true"`
+  * An index with a "text" field, which at query time uses the !SynonymFilter with the synonym
`TV, Televesion` and `expand="true"`
-    * Many thousands of documents containing the term "text:TV"
+  * Many thousands of documents containing the term "text:TV"
-    * A few hundred documents containing the term "text:Television"
+  * A few hundred documents containing the term "text:Television"
  
  A query for `text:TV` will expand into `(text:TV text:Television)` and the lower docFreq
for `text:Television` will give the documents that match "Television" a much higher score
then docs that match "TV" comparably -- which may be somewhat counter intuitive to the client.
 Index time expansion (or reduction) will result in the same idf for all documents regardless
of which term the original text contained.
  
  <<Anchor(RemoveDuplicatesTokenFilter)>>
+ 
  ==== solr.RemoveDuplicatesTokenFilterFactory ====
- 
  Creates `org.apache.solr.analysis.RemoveDuplicatesTokenFilter`.
  
  Filters out any tokens which are at the same logical position in the tokenstream as a previous
token with the same text.  This situation can arise from a number of situations depending
on what the "up stream" token filters are -- notably when stemming synonyms with similar roots.
 It is usefull to remove the duplicates to prevent `idf` inflation at index time, or `tf`
inflation (in a !MultiPhraseQuery) at query time.
  
  <<Anchor(ISOLatin1AccentFilter)>>
+ 
  ==== solr.ISOLatin1AccentFilterFactory ====
- 
  Creates `org.apache.lucene.analysis.ISOLatin1AccentFilter`.
  
  Replaces accented characters in the ISO Latin 1 character set (ISO-8859-1) by their unaccented
equivalent. Note that this is deprecated in favor of !ASCIIFoldingFilterFactory.
  
  <<Anchor(ASCIIFoldingFilterFactory)>>
+ 
  ==== solr.ASCIIFoldingFilterFactory ====
- 
  Creates `org.apache.lucene.analysis.ASCIIFoldingFilter`.
  
  Converts alphabetic, numeric, and symbolic Unicode characters which are not in the first
127 ASCII characters (the "Basic Latin" Unicode block) into their ASCII equivalents, if one
exists.
@@ -491, +491 @@

  See the [[http://lucene.apache.org/java/2_9_1/api/all/org/apache/lucene/analysis/ASCIIFoldingFilter.html|ASCIIFoldingFilter
Javadocs]] for more details.
  
  <<Anchor(PhoneticFilterFactory)>>
+ 
  ==== solr.PhoneticFilterFactory ====
- 
  <!> [[Solr1.2]]
  
  Creates `org.apache.solr.analysis.PhoneticFilter`.
  
  Uses [[http://jakarta.apache.org/commons/codec/|commons codec]] to generate phonetically
similar tokens.  This currently supports [[http://jakarta.apache.org/commons/codec/api-release/org/apache/commons/codec/language/package-summary.html|four
methods]].
- 
- || '''arg''' || '''value''' ||
+ ||'''arg''' ||'''value''' ||
- || encoder || one of: [[http://jakarta.apache.org/commons/codec/api-release/org/apache/commons/codec/language/DoubleMetaphone.html|DoubleMetaphone]],
[[http://jakarta.apache.org/commons/codec/api-release/org/apache/commons/codec/language/Metaphone.html|Metaphone]],
[[http://jakarta.apache.org/commons/codec/api-release/org/apache/commons/codec/language/Soundex.html|Soundex]],
[[http://jakarta.apache.org/commons/codec/api-release/org/apache/commons/codec/language/RefinedSoundex.html|RefinedSoundex]]
||
+ ||encoder ||one of: [[http://jakarta.apache.org/commons/codec/api-release/org/apache/commons/codec/language/DoubleMetaphone.html|DoubleMetaphone]],
[[http://jakarta.apache.org/commons/codec/api-release/org/apache/commons/codec/language/Metaphone.html|Metaphone]],
[[http://jakarta.apache.org/commons/codec/api-release/org/apache/commons/codec/language/Soundex.html|Soundex]],
[[http://jakarta.apache.org/commons/codec/api-release/org/apache/commons/codec/language/RefinedSoundex.html|RefinedSoundex]]
||
- || inject|| true/false -- true will add tokens to the stream, false will replace the existing
token ||
+ ||inject ||true/false -- true will add tokens to the stream, false will replace the existing
token ||
- || maxCodeLength|| integer -- sets the maximum length of the code to be generated. Supported
only for Metaphone and !DoubleMetaphone encodings ||
+ ||maxCodeLength ||integer -- sets the maximum length of the code to be generated. Supported
only for Metaphone and !DoubleMetaphone encodings ||
+ 
+ 
+ 
  
  {{{
    <filter class="solr.PhoneticFilterFactory" encoder="DoubleMetaphone" inject="true"/>
  }}}
- 
- 
  <<Anchor(ShingleFilterFactory)>>
+ 
  ==== solr.ShingleFilterFactory ====
- 
  <!> [[Solr1.3]]
  
  Creates [[http://hudson.zones.apache.org/hudson/job/Lucene-trunk/javadoc/contrib-analyzers/org/apache/lucene/analysis/shingle/ShingleFilter.html|org.apache.lucene.analysis.shingle.ShingleFilter]].
@@ -519, +519 @@

  A ShingleFilter constructs shingles (token n-grams) from a token stream. In other words,
it creates combinations of tokens as a single token.
  
  For example, the sentence "please divide this sentence into shingles" might be tokenized
into shingles "please divide", "divide this", "this sentence", "sentence into", and "into
shingles".
- 
- 
- || '''arg''' || '''default value''' || '''note''' ||
+ ||'''arg''' ||'''default value''' ||'''note''' ||
- || maxShingleSize || 2 || ||
+ ||maxShingleSize ||2 || ||
- || minShingleSize || 2 || <!> [[Solr3.1]] -- [[https://issues.apache.org/jira/browse/SOLR-1740|SOLR-1740]]
||
+ ||minShingleSize ||2 || <!> [[Solr3.1]] -- [[https://issues.apache.org/jira/browse/SOLR-1740|SOLR-1740]]
||
- || outputUnigrams || true || ||
+ ||outputUnigrams ||true || ||
- || tokenSeparator || " " || <!> [[Solr3.1]] -- [[https://issues.apache.org/jira/browse/SOLR-1740|SOLR-1740]]
||
+ ||tokenSeparator ||" " || <!> [[Solr3.1]] -- [[https://issues.apache.org/jira/browse/SOLR-1740|SOLR-1740]]
||
+ 
+ 
+ 
  
  {{{
    <filter class="solr.ShingleFilterFactory" maxShingleSize="2" outputUnigrams="true"/>
  }}}
- 
- 
- 
  <<Anchor(PositionFilterFactory)>>
+ 
  ==== solr.PositionFilterFactory ====
- 
  <!> [[Solr1.4]]
  
  Creates [[http://hudson.zones.apache.org/hudson/job/Lucene-trunk/javadoc/contrib-analyzers/org/apache/lucene/analysis/position/PositionFilter.html|org.apache.lucene.analysis.position.PositionFilter]].
@@ -543, +541 @@

  A PositionFilter manipulates the position of tokens in the stream.
  
  Set the positionIncrement of all tokens to the "positionIncrement", except the first return
token which retains its original positionIncrement value.
- 
- || '''arg''' || '''value''' ||
+ ||'''arg''' ||'''value''' ||
- || positionIncrement || default 0 ||
+ ||positionIncrement ||default 0 ||
+ 
+ 
+ 
  
  {{{
    <filter class="solr.PositionFilterFactory" />
  }}}
- 
  PositionFilter can be used with a query Analyzer to prevent expensive Phrase and MultiPhraseQueries.
When QueryParser parses a query, it first divides text on whitespace, and then Analyzes each
whitespace token. Some TokenStreams such as StandardTokenizer or WordDelimiterFilter may divide
one of these whitespace-separate tokens into multiple tokens.
  
  The QueryParser will turn "multiple tokens" into a Phrase or MultiPhraseQuery, but "multiple
tokens at the same position with only a position count of 1" is treated as a special case.
You can use PositionFilter at the end of your QueryAnalyzer to force any subsequent tokens
after the first one to have a position increment of zero, to trigger this case.
  
  For example, by default a query of "Wi-Fi" with StandardTokenizer will create a PhraseQuery:
+ 
- {{{ 
+ {{{
  field:"Wi Fi"
  }}}
  If you instead wrap the StandardTokenizer with PositionFilter, the "Fi" will have a position
increment of zero, creating a BooleanQuery:
+ 
  {{{
  field:Wi field:Fi
  }}}
- 
- Another example is when exact matching hits are wanted for _any_ shingle within the query.
(This was done at http://sesam.no to replace three proprietary 'FAST Query-Matching servers'
with two open sourced Solr indexes, background reading in [[http://sesat.no/howto-solr-query-evaluation.html|sesat]]
and on the [[http://comments.gmane.org/gmane.comp.jakarta.lucene.user/34746|mailing list]]).
+ Another example is when exact matching hits are wanted for _any_ shingle within the query.
(This was done at http://sesam.no to replace three proprietary 'FAST Query-Matching servers'
with two open sourced Solr indexes, background reading in [[http://sesat.no/howto-solr-query-evaluation.html|sesat]]
and on the [[http://comments.gmane.org/gmane.comp.jakarta.lucene.user/34746|mailing list]]).
It was needed that in the query all words and shingles to be placed at the same position,
so that all shingles to be treated as synonyms of each other.
- It was needed that in the query all words and shingles to be placed at the same position,
so that all shingles to be treated as synonyms of each other.
  
- With only the ShingleFilter the shingles generated are synonyms only to the first term in
each shingle group.
+ With only the ShingleFilter the shingles generated are synonyms only to the first term in
each shingle group. For example the query "abcd efgh ijkl" results in a query like:
- For example the query "abcd efgh ijkl" results in a query like:
+ 
-   ("abcd" "abcd efgh" "abcd efgh ijkl") ("efgh" efgh ijkl") ("ijkl")
+  . ("abcd" "abcd efgh" "abcd efgh ijkl") ("efgh" efgh ijkl") ("ijkl")
+ 
  where "abcd efgh" and "abcd efgh ijkl" are synonyms of "abcd", and "efgh ijkl" is a synonym
of "efgh".
  
  ShingleFilter does not offer a way to alter this behaviour.
  
- Using the PositionFilter in combination makes it possible to make all shingles synonyms
of each other.
+ Using the PositionFilter in combination makes it possible to make all shingles synonyms
of each other. Such a configuration could look like:
- Such a configuration could look like:
+ 
  {{{
     <fieldType name="shingleString" class="solr.TextField" positionIncrementGap="100"
omitNorms="true">
        <analyzer type="index">
@@ -588, +588 @@

        </analyzer>
      </fieldType>
  }}}
- 
  <<Anchor(ReversedWildcardFilterFactory)>>
+ 
  ==== solr.ReversedWildcardFilterFactory ====
  <!> [[Solr1.4]]
  
@@ -598, +598 @@

  See the [[http://lucene.apache.org/solr/api/org/apache/solr/analysis/ReversedWildcardFilterFactory.html|javadoc]]
for more details, or the [[http://svn.apache.org/viewvc/lucene/dev/trunk/solr/example/solr/conf/schema.xml?view=markup|example
schema]].
  
  <<Anchor(CollationKeyFilterFactory)>>
+ 
  ==== solr.CollationKeyFilterFactory ====
  <!> [[Solr1.5]]
  
  A filter that lets one specify:
+ 
   1. A system collator associated with a locale, or
-  2. A collator based on custom rules
+  1. A collator based on custom rules
  
+ This can be used for changing sort order for non-english languages as well as to modify
the collation sequence for certain languages. You must use the same  !CollationKeyFilter at
both index-time and query-time for correct results. Also, the JVM vendor, version (including
patch version) of the slave should be exactly same as the master (or indexer) for consistent
results.
- This can be used for changing sort order for non-english languages as well as to modify
the collation sequence for certain languages. You must use the same 
- !CollationKeyFilter at both index-time and query-time for correct results. Also, the JVM
vendor, version (including patch version) of the slave should be exactly same as the master
(or indexer) for consistent results.
  
  Also see
+ 
   1. [[http://lucene.apache.org/solr/api/org/apache/solr/analysis/CollationKeyFilterFactory.html|Javadocs]]
-  2. [[http://lucene.apache.org/java/2_9_1/api/contrib-collation/org/apache/lucene/collation/package-summary.html|Lucene
2.9.1 contrib-collation documentation]]
+  1. [[http://lucene.apache.org/java/2_9_1/api/contrib-collation/org/apache/lucene/collation/package-summary.html|Lucene
2.9.1 contrib-collation documentation]]
-  3. [[http://lucene.apache.org/java/2_9_1/api/contrib-collation/org/apache/lucene/collation/CollationKeyFilter.html|Lucene's
CollationKeyFilter javadocs]]
+  1. [[http://lucene.apache.org/java/2_9_1/api/contrib-collation/org/apache/lucene/collation/CollationKeyFilter.html|Lucene's
CollationKeyFilter javadocs]]
-  4. UnicodeCollation
+  1. UnicodeCollation
  

Mime
View raw message