lucene-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Cassandra Targett (Confluence)" <conflue...@apache.org>
Subject [CONF] Apache Solr Reference Guide > Tokenizers
Date Fri, 27 Sep 2013 19:43:00 GMT
Space: Apache Solr Reference Guide (https://cwiki.apache.org/confluence/display/solr)
Page: Tokenizers (https://cwiki.apache.org/confluence/display/solr/Tokenizers)

Change Comment:
---------------------------------------------------------------------
consistency fixes, steve's list of issues

Edited by Cassandra Targett:
---------------------------------------------------------------------
{section}
{column:width=70%}

You configure the tokenizer for a text field type in {{schema.xml}} with a {{<tokenizer>}}
element, as a child of {{<analyzer>}}:

{code:xml|borderStyle=solid|borderColor=#666666}
<fieldType name="text" class="solr.TextField">
    <analyzer type="index">
        <tokenizer class="solr.StandardTokenizerFactory"/>
        <filter class="solr.StandardFilterFactory"/>
    </analyzer>
</fieldType>
{code}

The class attribute names a factory class that will instantiate a tokenizer object when needed.
Tokenizer factory classes implement the {{org.apache.solr.analysis.TokenizerFactory}}. A TokenizerFactory's
{{create()}} method accepts a Reader and returns a TokenStream. When Solr creates the tokenizer
it passes a Reader object that provides the content of the text field.

Arguments may be passed to tokenizer factories by setting attributes on the {{<tokenizer>}}
element.

{code:xml|borderStyle=solid|borderColor=#666666}
<fieldType name="semicolonDelimited" class="solr.TextField">
  <analyzer type="query">
  <tokenizer class="solr.PatternTokenizerFactory" pattern="; "/>
  <analyzer>
</fieldType>
{code}

The following sections describe the tokenizer factory classes included in this release of
Solr.

For more information about Solr's tokenizers, see [http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters].
{column}
{column:width=30%}
{panel}

Tokenizers discussed in this section:

{toc}

{panel}
{column}
{section}

h2. Standard Tokenizer

This tokenizer splits the text field into tokens, treating whitespace and punctuation as delimiters.
Delimiter characters are discarded, with the following exceptions:

* Periods (dots) that are not followed by whitespace are kept as part of the token, including
Internet domain names.

* Words are split at hyphens, unless there is a number in the word, in which case the token
is not split and the numbers and hyphen(s) are preserved.

* The "@" character is among the set of token-splitting punctuation, so email addresses are
*not* preserved as single tokens.

The Standard Tokenizer supports [Unicode standard annex UAX#29|http://unicode.org/reports/tr29/#Word_Boundaries]
word boundaries with the following token types: {{<ALPHANUM>}}, {{<NUM>}}, {{<SOUTHEAST_ASIAN>}},
{{<IDEOGRAPHIC>}}, and {{<HIRAGANA>}}.

*Factory class:* {{solr.StandardTokenizerFactory}}

*Arguments:*

{{maxTokenLength}}: (integer, default 255) Solr ignores tokens that exceed the number of characters
specified by {{maxTokenLength}}.

*Example:*

{code:xml|borderStyle=solid|borderColor=#666666}
<analyzer>
  <tokenizer class="solr.StandardTokenizerFactory"/>
</analyzer>
{code}

*In:* "Please, email john.doe@foo.com by 03-09, re: m37-xq."

*Out:* "Please", "email", "john.doe@foo.com", "by", "03-09", "re", "m37-xq"

h2. Classic Tokenizer

The Classic Tokenizer preserves the same behavior as the Standard Tokenizer of Solr versions
3.1 and previous. It does not use the [Unicode standard annex UAX#29|http://unicode.org/reports/tr29/#Word_Boundaries]
word boundary rules that the Standard Tokenizer uses. This tokenizer splits the text field
into tokens, treating whitespace and punctuation as delimiters. Delimiter characters are discarded,
with the following exceptions:

* Periods (dots) that are not followed by whitespace are kept as part of the token.

* Words are split at hyphens, unless there is a number in the word, in which case the token
is not split and the numbers and hyphen(s) are preserved.

* Recognizes Internet domain names and email addresses and preserves them as a single token.

*Factory class:* {{solr.ClassicTokenizerFactory}}

*Arguments:*

{{maxTokenLength}}: (integer, default 255) Solr ignores tokens that exceed the number of characters
specified by {{maxTokenLength}}.

*Example:*

{code:xml|borderStyle=solid|borderColor=#666666}
<analyzer>
  <tokenizer class="solr.ClassicTokenizerFactory"/>
</analyzer>
{code}

*In:* "Please, email john.doe@foo.com by 03-09, re: m37-xq."

*Out:* "Please", "email", "john.doe@foo.com", "by", "03-09", "re", "m37-xq"

h2. Keyword Tokenizer

This tokenizer treats the entire text field as a single token.

*Factory class:* {{solr.KeywordTokenizerFactory}}

*Arguments:* None

*Example:*

{code:xml|borderStyle=solid|borderColor=#666666}
<analyzer>
  <tokenizer class="solr.KeywordTokenizerFactory"/>
</analyzer>
{code}

*In:* "Please, email john.doe@foo.com by 03-09, re: m37-xq."

*Out:* "Please, email john.doe@foo.com by 03-09, re: m37-xq."

h2. Letter Tokenizer

This tokenizer creates tokens from strings of contiguous letters, discarding all non-letter
characters.

*Factory class:* {{solr.LetterTokenizerFactory}}

*Arguments:* None

*Example:*

{code:xml|borderStyle=solid|borderColor=#666666}
<analyzer>
  <tokenizer class="solr.LetterTokenizerFactory"/>
</analyzer>
{code}

*In:* "I can't."

*Out:* "I", "can", "t"

h2. Lower Case Tokenizer

Tokenizes the input stream by delimiting at non-letters and then converting all letters to
lowercase. Whitespace and non-letters are discarded.

*Factory class:* {{solr.LowerCaseTokenizerFactory}}

*Arguments:* None

*Example:*

{code:xml|borderStyle=solid|borderColor=#666666}
<analyzer>
  <tokenizer class="solr.LowerCaseTokenizerFactory"/>
</analyzer>
{code}

*In:* "I just *LOVE* my iPhone\!"

*Out:* "i", "just", "love", "my", "iphone"

h2. N-Gram Tokenizer

Reads the field text and generates n-gram tokens of sizes in the given range.

*Factory class:* {{solr.NGramTokenizerFactory}}

*Arguments:*

{{minGramSize}}: (integer, default 1) The minimum n-gram size, must be > 0.

{{maxGramSize}}: (integer, default 2) The maximum n-gram size, must be >= {{minGramSize}}.

*Example:*

Default behavior. Note that this tokenizer operates over the whole field. It does not break
the field at whitespace. As a result, the space character is included in the encoding.

{code:xml|borderStyle=solid|borderColor=#666666}
<analyzer>
  <tokenizer class="solr.NGramTokenizerFactory"/>
</analyzer>
{code}

*In:* "hey man"

*Out:* "h", "e", "y", " ", "m", "a", "n", "he", "ey", "y ", " m", "ma", "an"

*Example:*

With an n-gram size range of 4 to 5:

{code:xml|borderStyle=solid|borderColor=#666666}
<analyzer>
  <tokenizer class="solr.NGramTokenizerFactory" minGramSize="4" maxGramSize="5"/>
</analyzer>
{code}

*In:* "bicycle"

*Out:* "bicy", "bicyc", "icyc", "icycl", "cycl", "cycle", "ycle"

h2. Edge N-Gram Tokenizer

Reads the field text and generates edge n-gram tokens of sizes in the given range.

*Factory class:* {{solr.EdgeNGramTokenizerFactory}}

*Arguments:*

{{minGramSize}}: (integer, default is 1) The minimum n-gram size, must be > 0.

{{maxGramSize}}: (integer, default is 1) The maximum n-gram size, must be >= {{minGramSize}}.

{{side}}: ("front" or "back", default is "front") Whether to compute the n-grams from the
beginning (front) of the text or from the end (back).

*Example:*

Default behavior (min and max default to 1):

{code:xml|borderStyle=solid|borderColor=#666666}
<analyzer>
  <tokenizer class="solr.EdgeNGramTokenizerFactory"/>
</analyzer>
{code}

*In:* "babaloo"

*Out:* "b"

*Example:*

Edge n-gram range of 2 to 5

{code:xml|borderStyle=solid|borderColor=#666666}
<analyzer>
  <tokenizer class="solr.EdgeNGramTokenizerFactory" minGramSize="2" maxGramSize="5"/>
</analyzer>
{code}

*In:* "babaloo"

*Out:*"ba", "bab", "baba", "babal"

*Example:*

Edge n-gram range of 2 to 5, from the back side:

{code:xml|borderStyle=solid|borderColor=#666666}
<analyzer>
  <tokenizer class="solr.EdgeNGramTokenizerFactory" minGramSize="2" maxGramSize="5" side="back"/>
</analyzer>
{code}

*In:* "babaloo"

*Out:* "oo", "loo", "aloo", "baloo"

h2. ICU Tokenizer

This tokenizer processes multilingual text and tokenizes it appropriately based on its script
attribute.

You can customize this tokenizer's behavior by specifying [per-script rule files|http://userguide.icu-project.org/boundaryanalysis#TOC-RBBI-Rules].
To add per-script rules, add a {{rulefiles}} argument, which should contain a comma-separated
list of {{code:rulefile}} pairs in the following format: four-letter ISO 15924 script code,
followed by a colon, then a resource path. For example, to specify rules for Latin (script
code "Latn") and Cyrillic (script code "Cyrl"), you would enter {{Latn:my.Latin.rules.rbbi,Cyrl:my.Cyrillic.rules.rbbi}}.

*Factory class:* {{solr.ICUTokenizerFactory}}

*Arguments:*

{{rulefile}}: a comma-separated list of {{code:rulefile}} pairs in the following format: four-letter
ISO 15924 script code, followed by a colon, then a resource path.

*Example:*

{code:xml|borderStyle=solid|borderColor=#666666}
  <analyzer>
    <tokenizer class="solr.ICUTokenizerFactory"
    rulefiles="Latn:my.Latin.rules.rbbi,Cyrl:my.Cyrillic.rules.rbbi"
  />
  </analyzer>
{code}

h2. Path Hierarchy Tokenizer

This tokenizer creates synonyms from file path hierarchies.

*Factory class:* {{solr.PathHierarchyTokenizerFactory}}

*Arguments:*

{{delimiter}}: (character, no default) You can specify the file path delimiter and replace
it with a delimiter you provide. This can be useful for working with backslash delimiters.

{{replace}}: (character, no default) Specifies the delimiter character Solr uses in the tokenized
output.

*Example:*

{code:xml|borderStyle=solid|borderColor=#666666}
<fieldType name="text_path" class="solr.TextField" positionIncrementGap="100">
  <analyzer>
    <tokenizer class="solr.PathHierarchyTokenizerFactory" delimiter="\" replace="/"/>
  </analyzer>
</fieldType>
{code}

*In:* "c:\usr\local\apache"

*Out:* "c:", "c:/usr", "c:/usr/local", "c:/usr/local/apache"

h2. Regular Expression Pattern Tokenizer

This tokenizer uses a Java regular expression to break the input text stream into tokens.
The expression provided by the pattern argument can be interpreted either as a delimiter that
separates tokens, or to match patterns that should be extracted from the text as tokens.

See the Javadocs for [{{java.util.regex.Pattern}}|http://docs.oracle.com/javase/6/docs/api/java/util/regex/Pattern.html]
for more information on Java regular expression syntax.

*Factory class:* {{solr.PatternTokenizerFactory}}

*Arguments:*

{{pattern}}: (Required) The regular expression, as defined by in {{java.util.regex.Pattern}}.

{{group}}: (Optional, default \-1) Specifies which regex group to extract as the token(s).
The value \-1 means the regex should be treated as a delimiter that separates tokens. Non-negative
group numbers (>= 0) indicate that character sequences matching that regex group should
be converted to tokens. Group zero refers to the entire regex, groups greater than zero refer
to parenthesized sub-expressions of the regex, counted from left to right.

*Example:*

A comma separated list. Tokens are separated by a sequence of zero or more spaces, a comma,
and zero or more spaces.

{code:xml|borderStyle=solid|borderColor=#666666}
<analyzer>
  <tokenizer class="solr.PatternTokenizerFactory" pattern="\s*,\s*"/>
</analyzer>
{code}

*In:* "fee,fie, foe ,    fum,   foo"

*Out:* "fee", "fie", "foe", "fum", "foo"

*Example:*

Extract simple, capitalized words.  A sequence of at least one capital letter followed by
zero or more letters of either case is extracted as a token.

{code:xml|borderStyle=solid|borderColor=#666666}
<analyzer>
  <tokenizer class="solr.PatternTokenizerFactory" pattern="\[A-Z\]\[A-Za-z\]" group="0"/>
</analyzer>
{code}

*In:* "Hello. My name is Inigo Montoya. You killed my father. Prepare to die."

*Out:* "Hello", "My", "Inigo", "Montoya", "You", "Prepare"

*Example:*

Extract part numbers which are preceded by "SKU", "Part" or "Part Number", case sensitive,
with an optional semi-colon separator. Part numbers must be all numeric digits, with an optional
hyphen. Regex capture groups are numbered by counting left parenthesis from left to right.
Group 3 is the subexpression "\[0-9-\]+", which matches one or more digits or hyphens.

{code:xml|borderStyle=solid|borderColor=#666666}
<analyzer>
  <tokenizer class="solr.PatternTokenizerFactory" pattern="(SKU|Part(\sNumber)?):?\s(\[0-9-\]+)"
group="3"/>
</analyzer>
{code}

*In:* "SKU: 1234, Part Number 5678, Part: 126-987"

*Out:* "1234", "5678", "126-987"

h2. Type Tokenizer

This tokenizer filters tokens by its type, with either an exclude or include list.

*Factory class:* {{solr.TypeTokenFilterFactory}}

*Arguments:*

{{types}}: Defines the location of a file of types to filter.

{{enablePositionIncrements}}: If *true*, the token will be incremented by position.

{{useWhiteList}}: If *true*, the file defined in {{types}} should be used as include list.

*Example:*

{code:xml|borderStyle=solid|borderColor=#666666}
<analyzer>
   <filter class="solr.TypeTokenFilterFactory" types="stoptypes.txt"
      enablePositionIncrements="true" useWhiteList="false"/>
</analyzer>
{code}

h2. UAX29 URL Email Tokenizer

This tokenizer splits the text field into tokens, treating whitespace and punctuation as delimiters.
Delimiter characters are discarded, with the following exceptions:

* Periods (dots) that are not followed by whitespace are kept as part of the token.

* Words are split at hyphens, unless there is a number in the word, in which case the token
is not split and the numbers and hyphen(s) are preserved.

* Recognizes top-level Internet domain names (validated against the white list in the [IANA
Root Zone Database|http://www.internic.net/zones/root.zone] when the tokenizer was generated);
email addresses; {{file}}{{:}}{{//}}, {{http(s)://}}, and {{ftp://}} addresses; IPv4 and IPv6
addresses; and preserves them as a single token.

The UAX29 URL Email Tokenizer supports [Unicode standard annex UAX#29|http://unicode.org/reports/tr29/#Word_Boundaries]
word boundaries with the following token types: {{<ALPHANUM>}}, {{<NUM>}}, {{<URL>}},
{{<EMAIL>}}, {{<SOUTHEAST_ASIAN>}}, {{<IDEOGRAPHIC>}}, and {{<HIRAGANA>}}.

*Factory class:* {{solr.UAX29URLEmailTokenizerFactory}}

*Arguments:*

{{maxTokenLength}}: (integer, default 255) Solr ignores tokens that exceed the number of characters
specified by {{maxTokenLength}}.

*Example:*

{code:xml|borderStyle=solid|borderColor=#666666}
<analyzer>
  <tokenizer class="solr.UAX29URLEmailTokenizerFactory"/>
</analyzer>
{code}

*In:* "Visit {nolink:http://accarol.com/contact.htm?from=external&a=10} or e-mail bob.cratchet@accarol.com"

*Out:* "Visit", "http://accarol.com/contact.htm?from=external&a=10", "or", "email", "bob.cratchet@accarol.com"

h2. White Space Tokenizer

Simple tokenizer that splits the text stream on whitespace and returns sequences of non-whitespace
characters as tokens. Note that any punctuation _will_ be included in the tokenization.

*Factory class:* {{solr.WhitespaceTokenizerFactory}}

*Arguments:* None

*Example:*

{code:xml|borderStyle=solid|borderColor=#666666}
<analyzer>
  <tokenizer class="solr.WhitespaceTokenizerFactory"/>
</analyzer>
{code}

*In:* "To be, or what?"

*Out:* "To", "be,", "or", "what?"

h2. Related Topics

* [TokenizerFactories|http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#TokenizerFactories]

{scrollbar}


Stop watching space: https://cwiki.apache.org/confluence/users/removespacenotification.action?spaceKey=solr
Change email notification preferences: https://cwiki.apache.org/confluence/users/editmyemailsettings.action


    

Mime
View raw message