lucene-solr-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Apache Wiki <wikidi...@apache.org>
Subject [Solr Wiki] Trivial Update of "Suggester" by Juan Grande
Date Thu, 13 Jan 2011 21:20:15 GMT
Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Solr Wiki" for change notification.

The "Suggester" page has been changed by Juan Grande.
The comment on this change is: Replaced all occurrences of "location" by "sourceLocation".
Fixed "Search handler configuration" section's bulleting..
http://wiki.apache.org/solr/Suggester?action=diff&rev1=6&rev2=7

--------------------------------------------------

  = Suggester - a flexible "autocomplete" component. =
- 
  A common need in search applications is suggesting query terms or phrases based on incomplete
user input. These completions may come from a dictionary that is based upon the main index
or upon any other arbitrary dictionary. It's often useful to be able to provide only top-N
suggestions, either ranked alphabetically or according to their usefulness for an average
user (e.g. popularity, or the number of returned results).
  
  Solr 3.x and 4.x include a component called Suggester that provides this functionality.
See [[https://issues.apache.org/jira/browse/SOLR-1316|SOLR-1316]] JIRA issue for the original
motivations and patches.
  
  Suggester reuses much of the SpellCheckComponent infrastructure, so it also reuses many
common SpellCheck parameters, such as `spellcheck=true` or `spellcheck.build=true`, etc. The
way this component is configured in `solrconfig.xml` is also very similar:
+ 
  {{{
    <searchComponent class="solr.SpellCheckComponent" name="suggest">
      <lst name="spellchecker">
@@ -34, +34 @@

      </arr>
    </requestHandler>
  }}}
- 
  The look-up of matching suggestions in a dictionary is implemented by subclasses of the
Lookup class - there are two implementations that are included in Solr, both are based on
in-memory tries: JaspellLookup and TSTLookup. Benchmarks indicate that TSTLookup provides
better performance at a lower memory cost (roughly 50% faster and 50% of memory cost) - however,
JaspellLookup can provide "fuzzy" suggestions, though this functionality is not currently
exposed (it's a one line change in JaspellLookup).
  
  An example of an autosuggest request:
+ 
  {{{
  http://localhost:8983/solr/suggest?q=ac
  }}}
+ And the corresponding response:
  
- And the corresponding response:
  {{{
  <?xml version="1.0" encoding="UTF-8"?>
  <response>
@@ -62, +62 @@

    </lst>
  </response>
  }}}
- 
  = Configuration =
  The configuration snippet above shows a few common configuration parameters. Here's a complete
list of them:
  
  == SpellCheckComponent configuration ==
- 
  * `searchComponent/@name` - arbitrary name for this component
  
  * `spellchecker` list:
+ 
-   * `name` - a symbolic name of this spellchecker (can be later referred to in URL parameters
and in SearchHandler configuration - see the section below)
+  * `name` - a symbolic name of this spellchecker (can be later referred to in URL parameters
and in SearchHandler configuration - see the section below)
-   * `classname` - Suggester, to provide the autocomplete functionality
+  * `classname` - Suggester, to provide the autocomplete functionality
-   * `lookupImpl` - Lookup implementation. Currently two in-memory implementations are available:
+  * `lookupImpl` - Lookup implementation. Currently two in-memory implementations are available:
-     * `org.apache.solr.suggest.tst.TSTLookup` - a simple compact ternary trie based lookup
+   * `org.apache.solr.suggest.tst.TSTLookup` - a simple compact ternary trie based lookup
-     * `org.apache.solr.suggest.jaspell.JaspellLookup` - a more complex lookup based on a
ternary trie from the [[http://jaspell.sourceforge.net/|JaSpell]] project.
+   * `org.apache.solr.suggest.jaspell.JaspellLookup` - a more complex lookup based on a ternary
trie from the [[http://jaspell.sourceforge.net/|JaSpell]] project.
-   * `buildOnCommit` - if set to true then the Lookup data structure will be rebuilt after
commit. If false (default) then the Lookup data will be built only when requested (by URL
parameter `spellcheck.build=true`). '''NOTE: currently implemented Lookup-s keep their data
in memory, so unlike spellchecker data this data is discarded on core reload and not available
until you invoke the build command, either explicitly or implicitly via commit.'''
+  * `buildOnCommit` - if set to true then the Lookup data structure will be rebuilt after
commit. If false (default) then the Lookup data will be built only when requested (by URL
parameter `spellcheck.build=true`). '''NOTE: currently implemented Lookup-s keep their data
in memory, so unlike spellchecker data this data is discarded on core reload and not available
until you invoke the build command, either explicitly or implicitly via commit.'''
-   * `location` - location of the dictionary file. If not empty then this is a path to a
dictionary file (see below). If this value is empty then the main index will be used as a
source of terms and weights.
+  * `sourceLocation` - location of the dictionary file. If not empty then this is a path
to a dictionary file (see below). If this value is empty then the main index will be used
as a source of terms and weights.
-   * `field` - if `location` is empty then terms from this field in the index will be used
when building the trie.
+  * `field` - if `sourceLocation` is empty then terms from this field in the index will be
used when building the trie.
-   * `threshold` - threshold is a value in [0..1] representing the minimum fraction of documents
(of the total) where a term should appear, in order to be added to the lookup dictionary.
+  * `threshold` - threshold is a value in [0..1] representing the minimum fraction of documents
(of the total) where a term should appear, in order to be added to the lookup dictionary.
  
  == Dictionary ==
- When a file-based dictionary is used (non-empty `location` parameter above) then it's expected
to be a plain text file in UTF-8 encoding. Blank lines and lines that start with a '#' are
ignored. The remaining lines must consist of either a string without literal TAB (\u0007)
character, or a string and a TAB separated floating-point weight.
+ When a file-based dictionary is used (non-empty `sourceLocation` parameter above) then it's
expected to be a plain text file in UTF-8 encoding. Blank lines and lines that start with
a '#' are ignored. The remaining lines must consist of either a string without literal TAB
(\u0007) character, or a string and a TAB separated floating-point weight.
  
  Example:
+ 
  {{{
  # This is a sample dictionary file.
  
@@ -92, +92 @@

  accidentally\t2.0
  accommodate\t3.0
  }}}
- 
  If weight is missing it's assumed to be equal 1.0. Weights affect the sorting of matching
suggestions when `spellcheck.onlyMorePopular=true` is selected - weights are treated as "popularity"
score, with higher weights preferred over suggestions with lower weights.
  
  Please note that the format of the file is not limited to single terms but can also contain
phrases - which is an improvement over the TermsComponent that you could also use for a simple
version of autocomplete functionality.
  
  === Threshold parameter ===
- As mentioned above, if the `location` parameter is empty then the terms from a field indicated
by the `field` parameter are used. It's often the case that due to imperfect source data there
are many uncommon or invalid terms that occur only once in the whole corpus (e.g. OCR errors,
typos, etc). According to the Zipf's law this actually forms the majority of terms, which
means that the dictionary built indiscriminately from a real-life index would consist mostly
of uncommon terms, and its size would be enormous. In order to avoid this and to reduce the
size of in-memory structures it's best to set the `threshold` parameter to a value slightly
above zero (0.5% in the example above). This already vastly reduces the size of the dictionary
by skipping [[http://en.wikipedia.org/wiki/Hapax_legomenon|"hapax legomena"]] while still
preserving most of the common terms. This parameter has no effect when using a file-based
dictionary - it's assumed that only useful terms are found there. ;)
+ As mentioned above, if the `sourceLocation` parameter is empty then the terms from a field
indicated by the `field` parameter are used. It's often the case that due to imperfect source
data there are many uncommon or invalid terms that occur only once in the whole corpus (e.g.
OCR errors, typos, etc). According to the Zipf's law this actually forms the majority of terms,
which means that the dictionary built indiscriminately from a real-life index would consist
mostly of uncommon terms, and its size would be enormous. In order to avoid this and to reduce
the size of in-memory structures it's best to set the `threshold` parameter to a value slightly
above zero (0.5% in the example above). This already vastly reduces the size of the dictionary
by skipping [[http://en.wikipedia.org/wiki/Hapax_legomenon|"hapax legomena"]] while still
preserving most of the common terms. This parameter has no effect when using a file-based
dictionary - it's assumed that only useful terms are found there. ;)
  
  == SearchHandler configuration ==
  In the example above we add a new handler that uses SearchHandler with a single SearchComponent
that we just defined, namely the `suggest` component. Then we define a few defaults for this
component (that can be overridden with URL parameters):
  
- * `spellcheck=true` - because we always want to run the Suggester for queries submitted
to this handler.
+  * `spellcheck=true` - because we always want to run the Suggester for queries submitted
to this handler.
- * `spellcheck.dictionary=suggest` - this is the name of the dictionary component that we
configured above.
+  * `spellcheck.dictionary=suggest` - this is the name of the dictionary component that we
configured above.
- * `spellcheck.onlyMorePopular=true` - if this parameter is set to true then the suggestions
will be sorted by weight ("popularity") - the `count` parameter will effectively limit this
to a top-N list of best suggestions. If this is set to false then suggestions are sorted alphabetically.
+  * `spellcheck.onlyMorePopular=true` - if this parameter is set to true then the suggestions
will be sorted by weight ("popularity") - the `count` parameter will effectively limit this
to a top-N list of best suggestions. If this is set to false then suggestions are sorted alphabetically.
- * `spellcheck.count=5` - specifies to return up to 5 suggestions.
+  * `spellcheck.count=5` - specifies to return up to 5 suggestions.
- * `spellcheck.collate=true` - to provide a query collated with the first matching suggestion.
+  * `spellcheck.collate=true` - to provide a query collated with the first matching suggestion.
  
  = Tips and tricks =
- 
- * Use TSTLookup unless you need a more sophisticated matching from JaspellLookup. See [[https://issues.apache.org/jira/browse/SOLR-1316?focusedCommentId=12873599&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#action_12873599|benchmark
results]] - the source of this benchmark is in SuggesterTest.
+ * Use TSTLookup unless you need a more sophisticated matching from JaspellLookup. See [[https://issues.apache.org/jira/browse/SOLR-1316?focusedCommentId=12873599&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12873599|benchmark
results]] - the source of this benchmark is in SuggesterTest.
  
  * Use `threshold` parameter to limit the size of the trie, to reduce the build time and
to remove invalid/uncommon terms. Values below 0.01 should be sufficient, greater values can
be used to limit the impact of terms that occur in a larger portion of documents. Values above
0.5 probably don't make much sense.
  
  * Don't forget to invoke `spellcheck.build=true` after core reload. Or extend the Lookup
class to do this on init(), or implement the load/save methods in Lookup to persist this data
across core reloads.
  
  * If you want to use a dictionary file that contains phrases (actually, strings that can
be split into multiple tokens by the default QueryConverter) then define a different QueryConverter
like this:
+ 
  {{{
    <!--
    The SpellingQueryConverter to convert raw (CommonParams.Q) queries into tokens.  Uses
a simple regular expression

Mime
View raw message