lucene-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Hoss Man (Confluence)" <conflue...@apache.org>
Subject [CONF] Apache Solr Reference Guide > Spell Checking
Date Wed, 17 Jul 2013 23:50:00 GMT
Space: Apache Solr Reference Guide (https://cwiki.apache.org/confluence/display/solr)
Page: Spell Checking (https://cwiki.apache.org/confluence/display/solr/Spell+Checking)

Change Comment:
---------------------------------------------------------------------
SOLR-3240 - spellcheck.collateMaxCollectDocs

Edited by Hoss Man:
---------------------------------------------------------------------
{section}
{column:width=75%}
The SpellCheck component is designed to provide inline query suggestions based on other, similar,
terms. The basis for these suggestions can be terms in a field in Solr, externally created
text files, or fields in other Lucene indexes.
{column}

{column:width=25%}
{panel}
Topics covered in this section:
{toc:maxLevel=2}
{panel}
{column}
{section}

h2. Configuring the SpellCheckComponent

h3. Define Spell Check in {{solrconfig.xml}}

The first step is to specify the source of terms in {{solrconfig.xml}}. There are three approaches
to spell checking in Solr, discussed below.

h4. IndexBasedSpellChecker

The {{IndexBasedSpellChecker}} uses a Solr index as the basis for a parallel index used for
spell checking. It requires defining a field as the basis for the index terms; a common practice
is to copy terms from some fields (such as {{title}}, {{body}}, etc.) to another field created
for spell checking. Here is a simple example of configuring {{solrconfig.xml}} with the {{IndexBasedSpellChecker}}:

{code:borderStyle=solid|borderColor=#666666}
<searchComponent name="spellcheck" class="solr.SpellCheckComponent">
    <lst name="spellchecker">
      <str name="classname">solr.IndexBasedSpellChecker</str>
      <str name="spellcheckIndexDir">./spellchecker</str>
      <str name="field">content</str>
      <str name="buildOnCommit">true</str>
    </lst>
</searchComponent>
{code}

The first element defines the {{searchComponent}} to use the {{solr.SpellCheckComponent}}.
The {{classname}} is the specific implementation of the SpellCheckComponent, in this case
{{solr.IndexBasedSpellChecker}}. Defining the {{classname}} is optional; if not defined, it
will default to {{IndexBasedSpellChecker}}.

The {{spellcheckIndexDir}} defines the location of the directory that holds the spellcheck
index, while the {{field}} defines the source field (defined in {{schema.xml}}) for spell
check terms. When choosing a field for the spellcheck index, it's best to avoid a heavily
processed field to get more accurate results. If the field has many word variations from processing
synonyms and/or stemming, the dictionary will be created with those variations in addition
to more valid spelling data.

Finally, _buildOnCommit_ defines whether to build the spell check index at every commit (that
is, every time new documents are added to the index). It is optional, and can be omitted if
you would rather set it to {{false}}.

h4. DirectSolrSpellChecker

The {{DirectSolrSpellChecker}} uses terms from the Solr index without building a parallel
index like the {{IndexBasedSpellChecker}}. It is considered experimental and still in development,
but is being used widely. This spell checker has the benefit of not having to be built regularly,
meaning that the terms are always up-to-date with terms in the index. Here is how this might
be configured in {{solrconfig.xml}}

{code:borderStyle=solid|borderColor=#666666}
<searchComponent name="spellcheck" class="solr.SpellCheckComponent">
 <lst name="spellchecker">
   <str name="name">default</str>
   <str name="field">name</str>
   <str name="classname">solr.DirectSolrSpellChecker</str>
   <str name="distanceMeasure">internal</str>
   <float name="accuracy">0.5</float>
   <int name="maxEdits">2</int>
   <int name="minPrefix">1</int>
   <int name="maxInspections">5</int>
   <int name="minQueryLength">4</int>
   <float name="maxQueryFrequency">0.01</float>
   <float name="thresholdTokenFrequency">.01</float>
 </lst>
</searchComponent>
{code}

When choosing a {{field}} to query for this spell checker, you want one which has relatively
little analysis performed on it (particularly analysis such as stemming). Note that you need
to specify a field to use for the suggestions, so like the {{IndexBasedSpellChecker}}, you
may want to copy data from fields like {{title}}, {{body}}, etc., to a field dedicated to
providing spelling suggestions.

Many of the parameters relate to how this spell checker should query the index for term suggestions.
The {{distanceMeasure}} defines the metric to use during the spell check query. The value
"internal" uses the default Levenshtein metric, which is the same metric used with the other
spell checker implementations.

Because this spell checker is querying the main index, you may want to limit how often it
queries the index to be sure to avoid any performance conflicts with user queries. The {{accuracy}}
setting defines the threshold for a valid suggestion, while {{maxEdits}} defines the number
of changes to the term to allow. Since most spelling mistakes are only 1 letter off, setting
this to 1 will reduce the number of possible suggestions (the default, however, is 2); the
value can only be 1 or 2. {{minPrefix}} defines the minimum number of characters the terms
should share. Setting this to 1 means that the spelling suggestions will all start with the
same letter, for example. 

The {{maxInspections}} parameter defines the maximum number of possible matches to review
before returning results; the default is 5. {{minQueryLength}} defines how many characters
must be in the query before suggestions are provided; the default is 4. {{maxQueryFrequency}}
sets the maximum threshold for the number of documents a term must appear in before being
considered as a suggestion. This can be a percentage (such as .01, or 1%) or an absolute value
(such as 4). A lower threshold is better for small indexes. Finally, {{tresholdTokenFrequency}}
sets the minimum number of documents a term must appear in, and can also be expressed as a
percentage or an absolute value.

h4. FileBasedSpellChecker

The {{FileBasedSpellChecker}} uses an external file as a spelling dictionary. This can be
useful if using Solr as a spelling server, or if spelling suggestions don't need to be based
on actual terms in the index. In {{solrconfig.xml}}, you would define the searchComponent
as so:

{code:borderStyle=solid|borderColor=#666666}
<searchComponent name="spellcheck" class="solr.SpellCheckComponent">
   <lst name="spellchecker">
      <str name="classname">solr.FileBasedSpellChecker</str>
      <str name="name">file</str>
      <str name="sourceLocation">spellings.txt</str>
      <str name="characterEncoding">UTF-8</str>
      <str name="spellcheckIndexDir">./spellcheckerFile</str>
   </lst>
</searchComponent>
{code}

The differences here are the use of the {{sourceLocation}} to define the location of the file
of terms and the use of {{characterEncoding}} to define the encoding of the terms file.

{info}
In the previous example, _name_ is used to name this specific definition of the spellchecker.
Multiple defintions can co-exist in a single {{solrconfig.xml}}, and the _name_ helps to differentiate
them when they are defined in the {{schema.xml}}.  If only defining one spellchecker, no name
is required.
{info}

h4. WordBreakSolrSpellChecker

A parallel implementation, {{WordBreakSolrSpellChecker}} offers suggestions by combining adjacent
query terms and/or breaking terms into multiple words. It is a {{SpellCheckComponent}} enhancement,
leveraging Lucene's {{WordBreakSpellChecker}}. It can detect spelling errors resulting from
misplaced whitespace without the use of shingle-based dictionaries and providess collation
support for word-break errors, including cases where the user has a mix of single-word spelling
errors and word-break errors in the same query. It also provides shard support.

Here is how it might be configured in {{solrconfig.xml}}:

{code:borderStyle=solid|borderColor=#666666}
<searchComponent name="spellcheck" class="solr.SpellCheckComponent">
  <lst name="spellchecker">
   <str name="name">wordbreak</str>
   <str name="classname">solr.WordBreakSolrSpellChecker</str>
   <str name="field">lowerfilt</str>
   <str name="combineWords">true</str>
   <str name="breakWords">true</str>
   <int name="maxChanges">10</int>
 </lst>
</searchComponent>
{code}

Some of the parameters will be familiar from the discussion of the other spell checkers, such
as {{name}}, {{classname}}, and {{field}}. New for this spell checker is {{combineWords}},
which defines whether words should be combined in a dictionary search (default is true); {{breakWords}},
which defines if words should be broken during a dictionary search (default is true); and
{{maxChanges}}, an integer which defines how many times the spell checker should check collation
possibilities against the index (default is 10).

The spellchecker can be configured with a traditional checker (ie: {{DirectSolrSpellChecker}}).
The results are combined and collations can contain a mix of corrections from both spellcheckers.


h3. Add It to a Request Handler

Queries will be sent to a [RequestHandler|Query Syntax and Parsing]. If every request should
generate a suggestion, then you would add the following to the {{requestHandler}} that you
are using:

{code:borderStyle=solid|borderColor=#666666}
<str name="spellcheck">true</str>
{code}

One of the possible parameters is the {{spellcheck.dictionary}} to use, and multiples can
be defined. With multiple dictionaries, all specified dictionaries are consulted and results
are interleaved.  Collations are created with combinations from the different spellcheckers,
with care taken that mutliple overlapping corrections do not occur in the same collation.

Here is an example with multiple dictionaries:

{code:borderStyle=solid|borderColor=#666666}
<requestHandler name="spellCheckWithWordbreak" class="org.apache.solr.handler.component.SearchHandler">
<lst name="defaults">
<str name="spellcheck.dictionary">default</str>
<str name="spellcheck.dictionary">wordbreak</str>
<str name="spellcheck.count">20</str>
</lst>
<arr name="last-components">
<str>spellcheck</str>
</arr>
</requestHandler>
{code}

h2. Spell Check Parameters

The SpellCheck component accepts the parameters described in the table below. All of these
parameters can be overridden by specifying {{spellcheck.collateParam.xx}} where _xx_ is the
parameter you are overriding.

|| Parameter || Description ||
| [spellcheck|#The {{spellcheck}} Parameter] | Turns on or off SpellCheck suggestions for
the request. If *true*, then spelling suggestions will be generated. |
| [spellcheck.q or q|#The {{spellcheck.q}} or {{q}} Parameter] | Selects the query to be spellchecked.
|
| [spellcheck.build|#The {{spellcheck.build}} Parameter] | Instructs Solr to build a dictionary
for use in spellchecking. |
| [spellcheck.collate|#The {{spellcheck.collate}} Parameter] | Causes Solr to build a new
query based on the best suggestion for each term in the submitted query. |
| [spellcheck.maxCollations|#The {{spellcheck.maxCollations}} Parameter] | This parameter
specifies the maximum number of collations to return. |
| [spellcheck.maxCollationTries|#The {{spellcheck.maxCollationTries}} Parameter] | This parameter
specifies the number of collation possibilities for Solr to try before giving up. |
| [spellcheck.maxCollationEvaluations|#The {{spellcheck.maxCollationEvaluations}} Parameter]
| This parameter specifies the maximum number of word correction combinations to rank and
evaluate prior to deciding which collation candidates to test against the index. |
| [spellcheck.collateExtendedResult|#The {{spellcheck.collateExtendedResult}} Parameter] |
If true, returns an expanded response detailing the collations found. If {{spellcheck.collate}}
is false, this parameter will be ignored. |
| [spellcheck.collateMaxCollectDocs|#The {{spellcheck.collateMaxCollectDocs}} Parameter] |
The maximum number of documents to collect when testing potential Collations |
| [spellcheck.count|#The {{spellcheck.count}} Parameter] | Specifies the maximum number of
spelling suggestions to be returned. |
| [spellcheck.dictionary|#The {{spellcheck.dictionary}} Parameter] | Specifies the dictionary
that should be used for spellchecking. |
| [spellcheck.extendedResults|#The {{spellcheck.extendedResults}} Parameter] | Causes Solr
to return additional information about spellcheck results, such as the frequency of each original
term in the index (origFreq) as well as the frequency of each suggestion in the index (frequency).
Note that this result format differs from the non-extended one as the returned suggestion
for a word is actually an array of lists, where each list holds the suggested term and its
frequency. |
| [spellcheck.onlyMorePopular|#The {{spellcheck.onlyMorePopular}} Parameter] | Limits spellcheck
responses to queries that are more popular than the original query. |
| [spellcheck.maxResultsForSuggest|#The {{spellcheck.maxResultsForSuggest}} Parameter] | The
maximum number of hits the request can return in order to both generate spelling suggestions
and set the "correctlySpelled" element to "false". |
| [spellcheck.alternativeTermCount|#The {{spellcheck.alternativeTermCount}} Parameter] | The
count of suggestions to return for each query term existing in the index and/or dictionary.
|
| [spellcheck.reload|#The {{spellcheck.reload}} Parameter] | Reloads the spellchecker. |
| [spellcheck.accuracy|#The {{spellcheck.accuracy}} Parameter] | Specifies an accuracy value
to help decide whether a result is worthwhile. |
| [spellcheck.<DICT_NAME>.key|#The {{spellcheck.<DICT_NAME>.key}} Parameter] |
Specifies a key/value pair for the implementation handling a given dictionary. |

h3. The {{spellcheck}} Parameter

This parameter turns on SpellCheck suggestions for the request. If *true*, then spelling suggestions
will be generated.

h3. The {{spellcheck.q}} or {{q}} Parameter

This parameter specifies the query to spellcheck. If {{spellcheck.q}} is defined, then it
is used; otherwise the original input query is used. The {{spellcheck.q}} parameter is intended
to be the original query, minus any extra markup like field names, boosts, and so on. If the
{{q}} parameter is specified, then the {{SpellingQueryConverter}} class is used to parse it
into tokens; otherwise the [{{WhitespaceTokenizer}}|Tokenizers#White Space Tokenizer] is used.
The choice of which one to use is up to the application. Essentially, if you have a spelling
"ready" version in your application, then it is probably better to use {{spellcheck.q}}. Otherwise,
if you just want Solr to do the job, use the {{q}} parameter.

{note}
The SpellingQueryConverter class does not deal properly with non-ASCII characters. In this
case, you have either to use {{spellcheck.q}}, or implement your own QueryConverter.
{note}

h3. The {{spellcheck.build}} Parameter

If set to *true*, this parameter creates the dictionary that the SolrSpellChecker will use
for spell-checking. In a typical search application, you will need to build the dictionary
before using the SolrSpellChecker. However, it's not always necessary to build a dictionary
first. For example, you can configure the spellchecker to use a dictionary that already exists.

The dictionary will take some time to build, so this parameter should not be sent with every
request.

h3. The {{spellcheck.reload}} Parameter

If set to true, this parameter reloads the spellchecker. The results depend on the implementation
of {{SolrSpellChecker.reload()}}. In a typical implementation, reloading the spellchecker
means reloading the dictionary.

h3. The {{spellcheck.count}} Parameter

This parameter specifies the maximum number of suggestions that the spellchecker should return
for a term. If this parameter isn't set, the value defaults to 1. If the parameter is set
but not assigned a number, the value defaults to 5. If the parameter is set to a positive
integer, that number becomes the maximum number of suggestions returned by the spellchecker.

h3. The {{spellcheck.onlyMorePopular}} Parameter

If *true*, Solr will to return suggestions that result in more hits for the query than the
existing query.  Note that this will return more popular suggestions even when the given query
term is present in the index and considered "correct".

h3. The {{spellcheck.maxResultsForSuggest}} Parameter

For example, if this is set to 5 and the user's query returns 5  or fewer results, the spellchecker
will report "correctlySpelled=false"  and also offer suggestions (and collations if requested).
 Setting this  greater than zero is useful for creating "did-you-mean?" suggestions for  queries
that return a low number of hits.

h3. The {{spellcheck.alternativeTermCount}} Parameter

Specify the number of suggestions to return for each query term existing in the  index and/or
dictionary.  Presumably, users will want fewer suggestions  for words with docFrequency>0.
 Also setting this value turns "on"  context-sensitive spell suggestions.

h3. The {{spellcheck.extendedResults}} Parameter

This parameter causes to Solr to include additional information about the suggestion, such
as the frequency in the index.

h3. The {{spellcheck.collate}} Parameter

If *true*, this parameter directs Solr to take the best suggestion for each token (if one
exists) and construct a new query from the suggestions. For example, if the input query was
"jawa class lording" and the best suggestion for "jawa" was "java" and "lording" was "loading",
then the resulting collation would be "java class loading".

The spellcheck.collate parameter only returns collations that are guaranteed to result in
hits if re-queried, even when applying original {{fq}} parameters. This is especially helpful
when there is more than one correction per query.

{note}
This only returns a query to be used. It does not actually run the suggested query.
{note}

h3. The {{spellcheck.maxCollations}} Parameter

The maximum number of collations to return.  The default is *1*. This parameter is ignored
if {{spellcheck.collate}} is false.

h3. The {{spellcheck.maxCollationTries}} Parameter

This parameter specifies the number of collation possibilities for Solr to try before giving
up. Lower values ensure better performance. Higher values may be necessary to find a collation
that can return results. The default value is {{0}}, which maintains backwards-compatible
(Solr 1.4) behavior (do not check collations). This parameter is ignored if {{spellcheck.collate}}
is false.

h3. The {{spellcheck.maxCollationEvaluations}} Parameter

This parameter specifies the maximum number of word correction combinations to rank and evaluate
prior to deciding which collation candidates to test against the index. This is a performance
safety-net in case a user enters a query with many misspelled words. The default is *10,000*
combinations, which should work well in most situations.

h3. The {{spellcheck.collateExtendedResult}} Parameter

If *true*, this parameter returns an expanded response format detailing the collations Solr
found. The default value is *false* and this is ignored if {{spellcheck.collate}} is false.


h3. The {{spellcheck.collateMaxCollectDocs}} Parameter

This parameter specifies the maximum number of documents that should be collect when testing
potential collations against the index.  A value of *0* indicates that all documents should
be collected, resulting in exact hit-counts.  Otherwise an estimation is provided as a performance
optimization in cases where exact hit-counts are unnecessary -- the higher the value specified,
the more precise the estimation.

The default value for this parameter is *0*, but when {{spellcheck.collateExtendedResults}}
is *false*, the optimization is always used as if a *1* had been specified.

h3. The {{spellcheck.dictionary}} Parameter

This parameter causes Solr to use the dictionary named in the parameter's argument. The default
setting is "default". This parameter can be used to invoke a specific spellchecker on a per
request basis.

h3. The {{spellcheck.accuracy}} Parameter

Specifies an accuracy value to be used by the spell checking implementation to decide whether
a result is worthwhile or not. The value is a float between 0 and 1. Defaults to {{Float.MIN_VALUE}}.

h3. The {{spellcheck.<DICT_NAME>.key}} Parameter

Specifies a key/value pair for the implementation handling a given dictionary. The value that
is passed through is just {{key=value}} ({{spellcheck.<DICT_NAME>.}} is stripped off.

For example, given a dictionary called {{foo}}, {{spellcheck.foo.myKey=myValue}} would result
in {{myKey=myValue}} being passed through to the implementation handling the dictionary {{foo}}.

h3. Example

This example shows the results of a simple query that defines a query using the {{spellcheck.q}}
parameter. The query also includes a {{spellcheck.build=true}} parameter, which is needs to
be called only once in order to build the index. {{spellcheck.build}} should not be specified
with for each request.

{{[http://localhost:8983/solr/spellCheckCompRH?q=*:*&spellcheck.q=hell%20ultrashar&spellcheck=true&spellcheck.build=true]}}

Results:

{code:xml|borderStyle=solid|borderColor=#666666}
<lst name="spellcheck">
        <lst name="suggestions">
                <lst name="hell">
                        <int name="numFound">1</int>
                        <int name="startOffset">0</int>
                        <int name="endOffset">4</int>
                        <arr name="suggestion">
                                <str>dell</str>
                        </arr>
                </lst>
                <lst name="ultrashar">
                        <int name="numFound">1</int>
                        <int name="startOffset">5</int>
                        <int name="endOffset">14</int>
                        <arr name="suggestion">
                                <str>ultrasharp</str>
                        </arr>
                </lst>
        </lst>
</lst>
{code}

h2. Distributed SpellCheck

The {{SpellCheckComponent}} also supports spellchecking on distributed indexes. If you are
using the SpellCheckComponent on a request handler other than "/select", you must provide
the following two parameters:

|| Parameter || Description ||
| shards | Specifies the shards in your distributed indexing configuration. For more information
about distributed indexing, see [Distributed Search with Index Sharding] |
| shards.qt | Specifies the request handler Solr uses for requests to shards. This parameter
is not required for the {{/select}} request handler. |

For example: {{[http://localhost:8983/solr/select?q=*:*&spellcheck=true&spellcheck.build=true&spellcheck.q=toyata&qt=spell&shards.qt=spell&shards=solr-shard1:8983/solr,solr-shard2:8983/solr]}}

In case of a distributed request to the SpellCheckComponent, the shards are requested for
at least five suggestions even if the {{spellcheck.count}} parameter value is less than five.
Once the suggestions are collected, they are ranked by the configured distance measure (Levenstein
Distance by default) and then by aggregate frequency.

{scrollbar}


Stop watching space: https://cwiki.apache.org/confluence/users/removespacenotification.action?spaceKey=solr
Change email notification preferences: https://cwiki.apache.org/confluence/users/editmyemailsettings.action


    

Mime
View raw message