lucene-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Steve Rowe (Confluence)" <conflue...@apache.org>
Subject [CONF] Apache Solr Reference Guide > What Is An Analyzer?
Date Thu, 26 Sep 2013 15:04:00 GMT
Space: Apache Solr Reference Guide (https://cwiki.apache.org/confluence/display/solr)
Page: What Is An Analyzer? (https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=32604227)

Change Comment:
---------------------------------------------------------------------
Reverted code block formatting 

Edited by Steve Rowe:
---------------------------------------------------------------------
An analyzer examines the text of fields and generates a token stream. Analyzers are specified
as a child of the {{<fieldType>}} element in the {{schema.xml}} configuration file that
can be found in the {{solr/conf}} directory, or wherever {{solrconfig.xml}} is located.

In normal usage, only fields of type {{solr.TextField}} will specify an analyzer.  The simplest
way to configure an analyzer is with a single {{<analyzer>}} element whose class attribute
is a fully qualified Java class name. The named class must derive from {{org.apache.lucene.analysis.Analyzer}}.
For example:

{code:xml|borderStyle=solid|borderColor=#666666}
<fieldType name="nametext" class="solr.TextField">
  <analyzer class="org.apache.lucene.analysis.WhitespaceAnalyzer"/>
</fieldType>
{code}

In this case a single class, {{WhitespaceAnalyzer}}, is responsible for analyzing the content
of the named text field and emitting the corresponding tokens. For simple cases, such as plain
English prose, a single analyzer class like this may be sufficient.  But it's often necessary
to do more complex analysis of the field content.

Even the most complex analysis requirements can usually be decomposed into a series of discrete,
relatively simple processing steps. As you will soon discover, the Solr distribution comes
with a large selection of tokenizers and filters that covers most scenarios you are likely
to encounter. Setting up an analyzer chain is very straightforward; you specify a simple {{<analyzer>}}
element (no class attribute) with child elements that name factory classes for the tokenizer
and filters to use, in the order you want them to run.

For example:

{code:xml|borderStyle=solid|borderColor=#666666}
<fieldType name="nametext" class="solr.TextField">
    <analyzer>
        <tokenizer class="solr.StandardTokenizerFactory"/>
        <filter class="solr.StandardFilterFactory"/>
        <filter class="solr.LowerCaseFilterFactory"/>
        <filter class="solr.StopFilterFactory"/>
        <filter class="solr.EnglishPorterFilterFactory"/>
    </analyzer>
</fieldType>
{code}

Note that classes in the {{org.apache.solr.analysis}} package may be referred to here with
the shorthand {{solr.}} prefix.

In this case, no Analyzer class was specified on the {{<analyzer>}} element.  Rather,
a sequence of more specialized classes are wired together and collectively act as the Analyzer
for the field.  The text of the field is passed to the first item in the list ({{solr.StandardTokenizerFactory}}),
and the tokens that emerge from the last one ({{solr.EnglishPorterFilterFactory}}) are the
terms that are used for indexing or querying any fields that use the "nametext" {{fieldType}}.

h2. Analysis Phases

Analysis takes place in two contexts. At index time, when a field is being created, the token
stream that results from analysis is added to an index and defines the set of terms (including
positions, sizes, and so on) for the field. At query time, the values being searched for are
analyzed and the terms that result are matched against those that are stored in the field's
index.

In many cases, the same analysis should be applied to both phases. This is desirable when
you want to query for exact string matches, possibly with case-insensitivity, for example.
In other cases, you may want to apply slightly different analysis steps during indexing than
those used at query time.

If you provide a simple {{<analyzer>}} definition for a field type, as in the examples
above, then it will be used for both indexing and queries. If you want distinct analyzers
for each phase, you may include two {{<analyzer>}} definitions distinguished with a
type attribute. For example:

{code:xml|borderStyle=solid|borderColor=#666666}
<fieldType name="nametext" class="solr.TextField">
    <analyzer *type="index"{*}>
        <tokenizer class="solr.StandardTokenizerFactory"/>
        <filter class="solr.LowerCaseFilterFactory"/>
        <filter class="solr.KeepWordFilterFactory" words="keepwords.txt"/>
        <filter class="solr.SynonymFilterFactory" synonyms="syns.txt"/>
    </analyzer>
    <analyzer *type="query"{*}>
        <tokenizer class="solr.StandardTokenizerFactory"/>
        <filter class="solr.LowerCaseFilterFactory"/>
    </analyzer>
</fieldType>
{code}

In this theoretical example, at index time the text is tokenized, the tokens are set to lowercase,
any that are not listed in {{keepwords.txt}} are discarded and those that remain are mapped
to alternate values as defined by the synonym rules in the file {{syns.txt}}. This essentially
builds an index from a restricted set of possible values and then normalizes them to values
that may not even occur in the original text.

At query time, the only normalization that happens is to convert the query terms to lowercase.
The filtering and mapping steps that occur at index time are not applied to the query terms.
 Queries must then, in this example, be very precise, using only the normalized terms that
were stored at index time.

{scrollbar}


Stop watching space: https://cwiki.apache.org/confluence/users/removespacenotification.action?spaceKey=solr
Change email notification preferences: https://cwiki.apache.org/confluence/users/editmyemailsettings.action


    

Mime
View raw message