lucene-solr-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Apache Wiki <wikidi...@apache.org>
Subject [Solr Wiki] Update of "Suggester" by DawidWeiss
Date Thu, 14 Apr 2011 11:11:01 GMT
Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Solr Wiki" for change notification.

The "Suggester" page has been changed by DawidWeiss.
http://wiki.apache.org/solr/Suggester?action=diff&rev1=9&rev2=10

--------------------------------------------------

        <str name="name">suggest</str>
        <str name="classname">org.apache.solr.spelling.suggest.Suggester</str>
        <str name="lookupImpl">org.apache.solr.spelling.suggest.tst.TSTLookup</str>
+       <!-- Alternatives to lookupImpl: 
+            org.apache.solr.spelling.suggest.fst.FSTLookup   [finite state automaton]
+            org.apache.solr.spelling.suggest.jaspell.JaspellLookup [default, jaspell-based]
+            org.apache.solr.spelling.suggest.tst.TSTLookup   [ternary trees]
+       -->
        <str name="field">name</str>  <!-- the indexed field to derive suggestions
from -->
        <float name="threshold">0.005</float>
        <str name="buildOnCommit">true</str>
@@ -36, +41 @@

      </arr>
    </requestHandler>
  }}}
- The look-up of matching suggestions in a dictionary is implemented by subclasses of the
Lookup class - there are two implementations that are included in Solr, both are based on
in-memory tries: JaspellLookup and TSTLookup. Benchmarks indicate that TSTLookup provides
better performance at a lower memory cost (roughly 50% faster and 50% of memory cost) - however,
JaspellLookup can provide "fuzzy" suggestions, though this functionality is not currently
exposed (it's a one line change in JaspellLookup).
+ The look-up of matching suggestions in a dictionary is implemented by subclasses of the
Lookup class - there are three implementations that are included in Solr:
+  * JaspellLookup - tree-based representation based on Jaspell,
+  * TSTLookup - ternary tree based representation, capable of immediate data structure updates,
+  * FSTLookup - automaton based representation; slower to build, but consumes far less memory
at runtime (see performance notes below).
+ 
+ For practical purposes all of the above implementations will most likely run at similar
speed when requests are made via the HTTP stack (which will
+ become the bottleneck). Direct benchmarks of these classes indicate that FSTLookup provides
better performance compared to the other two methods, at a much lower memory cost. JaspellLookup
can provide "fuzzy" suggestions, though this functionality is not currently exposed (it's
a one line change in JaspellLookup). Support for infix-suggestions is planned for FSTLookup
(which would be the only structure to support these).
  
  An example of an autosuggest request:
  
@@ -65, +76 @@

  </response>
  }}}
  = Configuration =
- The configuration snippet above shows a few common configuration parameters. Here's a complete
list of them:
+ The configuration snippet above shows a few common configuration parameters. A complete
list of them is best found int he source code of each Lookup class, but here is an overview:
  
  == SpellCheckComponent configuration ==
  * `searchComponent/@name` - arbitrary name for this component
@@ -77, +88 @@

   * `lookupImpl` - Lookup implementation. Currently two in-memory implementations are available:
    * `org.apache.solr.suggest.tst.TSTLookup` - a simple compact ternary trie based lookup
    * `org.apache.solr.suggest.jaspell.JaspellLookup` - a more complex lookup based on a ternary
trie from the [[http://jaspell.sourceforge.net/|JaSpell]] project.
+   * `org.apache.solr.suggest.fst.FSTLookup` - automaton-based lookup
   * `buildOnCommit` - if set to true then the Lookup data structure will be rebuilt after
commit. If false (default) then the Lookup data will be built only when requested (by URL
parameter `spellcheck.build=true`). '''NOTE: currently implemented Lookup-s keep their data
in memory, so unlike spellchecker data this data is discarded on core reload and not available
until you invoke the build command, either explicitly or implicitly via commit.'''
   * `sourceLocation` - location of the dictionary file. If not empty then this is a path
to a dictionary file (see below). If this value is empty then the main index will be used
as a source of terms and weights.
   * `field` - if `sourceLocation` is empty then terms from this field in the index will be
used when building the trie.
@@ -96, +108 @@

  }}}
  If weight is missing it's assumed to be equal 1.0. Weights affect the sorting of matching
suggestions when `spellcheck.onlyMorePopular=true` is selected - weights are treated as "popularity"
score, with higher weights preferred over suggestions with lower weights.
  
- Please note that the format of the file is not limited to single terms but can also contain
phrases - which is an improvement over the TermsComponent that you could also use for a simple
version of autocomplete functionality.
+ Please note that the format of the file is not limited to single terms but can also contain
phrases - which is an improvement over the TermsComponent that you could also use for a simple
version of autocomplete functionality. 
+ 
+ FSTLookup has a built-in mechism to discetize weights into a fixed set of buckets (to speed
up suggestions). The number of buckets is configurable.
  
  === Threshold parameter ===
  As mentioned above, if the `sourceLocation` parameter is empty then the terms from a field
indicated by the `field` parameter are used. It's often the case that due to imperfect source
data there are many uncommon or invalid terms that occur only once in the whole corpus (e.g.
OCR errors, typos, etc). According to the Zipf's law this actually forms the majority of terms,
which means that the dictionary built indiscriminately from a real-life index would consist
mostly of uncommon terms, and its size would be enormous. In order to avoid this and to reduce
the size of in-memory structures it's best to set the `threshold` parameter to a value slightly
above zero (0.5% in the example above). This already vastly reduces the size of the dictionary
by skipping [[http://en.wikipedia.org/wiki/Hapax_legomenon|"hapax legomena"]] while still
preserving most of the common terms. This parameter has no effect when using a file-based
dictionary - it's assumed that only useful terms are found there. ;)
@@ -111, +125 @@

   * `spellcheck.collate=true` - to provide a query collated with the first matching suggestion.
  
  = Tips and tricks =
- * Use TSTLookup unless you need a more sophisticated matching from JaspellLookup. See [[https://issues.apache.org/jira/browse/SOLR-1316?focusedCommentId=12873599&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12873599|benchmark
results]] - the source of this benchmark is in SuggesterTest.
+ * Use FSTLookup to conserve memory (unless you need a more sophisticated matching, then
look at JaspellLookup). There are some benchmarks of all three implementations: [[https://issues.apache.org/jira/browse/SOLR-1316?focusedCommentId=12873599&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12873599|SOLR-1316]]
(outdated) and a bit newer here:
+ [[https://issues.apache.org/jira/browse/SOLR-2378|SOLR-2378]]. The class to perform these
benchmarks is in the source tree and is called LookupBenchmarkTest.
  
  * Use `threshold` parameter to limit the size of the trie, to reduce the build time and
to remove invalid/uncommon terms. Values below 0.01 should be sufficient, greater values can
be used to limit the impact of terms that occur in a larger portion of documents. Values above
0.5 probably don't make much sense.
  

Mime
View raw message