lucene-solr-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Apache Wiki <wikidi...@apache.org>
Subject [Solr Wiki] Update of "Suggester" by RobertMuir
Date Sun, 19 Feb 2012 17:31:32 GMT
Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Solr Wiki" for change notification.

The "Suggester" page has been changed by RobertMuir:
http://wiki.apache.org/solr/Suggester?action=diff&rev1=10&rev2=11

Comment:
add docs for wFST impl

        <str name="lookupImpl">org.apache.solr.spelling.suggest.tst.TSTLookup</str>
        <!-- Alternatives to lookupImpl: 
             org.apache.solr.spelling.suggest.fst.FSTLookup   [finite state automaton]
+            org.apache.solr.spelling.suggest.fst.WFSTLookupFactory [weighted finite state
automaton]
             org.apache.solr.spelling.suggest.jaspell.JaspellLookup [default, jaspell-based]
             org.apache.solr.spelling.suggest.tst.TSTLookup   [ternary trees]
        -->
@@ -45, +46 @@

   * JaspellLookup - tree-based representation based on Jaspell,
   * TSTLookup - ternary tree based representation, capable of immediate data structure updates,
   * FSTLookup - automaton based representation; slower to build, but consumes far less memory
at runtime (see performance notes below).
+  * WFSTLookup - weighted automaton representation: an alternative to FSTLookup for more
fine-grained ranking. Solr 3.6+
  
  For practical purposes all of the above implementations will most likely run at similar
speed when requests are made via the HTTP stack (which will
- become the bottleneck). Direct benchmarks of these classes indicate that FSTLookup provides
better performance compared to the other two methods, at a much lower memory cost. JaspellLookup
can provide "fuzzy" suggestions, though this functionality is not currently exposed (it's
a one line change in JaspellLookup). Support for infix-suggestions is planned for FSTLookup
(which would be the only structure to support these).
+ become the bottleneck). Direct benchmarks of these classes indicate that (W)FSTLookup provides
better performance compared to the other two methods, at a much lower memory cost. JaspellLookup
can provide "fuzzy" suggestions, though this functionality is not currently exposed (it's
a one line change in JaspellLookup). Support for infix-suggestions is planned for FSTLookup
(which would be the only structure to support these).
  
  An example of an autosuggest request:
  
@@ -89, +91 @@

    * `org.apache.solr.suggest.tst.TSTLookup` - a simple compact ternary trie based lookup
    * `org.apache.solr.suggest.jaspell.JaspellLookup` - a more complex lookup based on a ternary
trie from the [[http://jaspell.sourceforge.net/|JaSpell]] project.
    * `org.apache.solr.suggest.fst.FSTLookup` - automaton-based lookup
+   * `org.apache.solr.spelling.suggest.fst.WFSTLookupFactory` - weighted automaton-based
lookup
   * `buildOnCommit` - if set to true then the Lookup data structure will be rebuilt after
commit. If false (default) then the Lookup data will be built only when requested (by URL
parameter `spellcheck.build=true`). '''NOTE: currently implemented Lookup-s keep their data
in memory, so unlike spellchecker data this data is discarded on core reload and not available
until you invoke the build command, either explicitly or implicitly via commit.'''
   * `sourceLocation` - location of the dictionary file. If not empty then this is a path
to a dictionary file (see below). If this value is empty then the main index will be used
as a source of terms and weights.
   * `field` - if `sourceLocation` is empty then terms from this field in the index will be
used when building the trie.
@@ -110, +113 @@

  
  Please note that the format of the file is not limited to single terms but can also contain
phrases - which is an improvement over the TermsComponent that you could also use for a simple
version of autocomplete functionality. 
  
- FSTLookup has a built-in mechism to discetize weights into a fixed set of buckets (to speed
up suggestions). The number of buckets is configurable.
+ FSTLookup has a built-in mechanism to discretize weights into a fixed set of buckets (to
speed up suggestions). The number of buckets is configurable.
+ 
+ WFSTLookup does not use buckets, but instead a shortest path algorithm. Note that it expects
weights to be whole numbers.
  
  === Threshold parameter ===
  As mentioned above, if the `sourceLocation` parameter is empty then the terms from a field
indicated by the `field` parameter are used. It's often the case that due to imperfect source
data there are many uncommon or invalid terms that occur only once in the whole corpus (e.g.
OCR errors, typos, etc). According to the Zipf's law this actually forms the majority of terms,
which means that the dictionary built indiscriminately from a real-life index would consist
mostly of uncommon terms, and its size would be enormous. In order to avoid this and to reduce
the size of in-memory structures it's best to set the `threshold` parameter to a value slightly
above zero (0.5% in the example above). This already vastly reduces the size of the dictionary
by skipping [[http://en.wikipedia.org/wiki/Hapax_legomenon|"hapax legomena"]] while still
preserving most of the common terms. This parameter has no effect when using a file-based
dictionary - it's assumed that only useful terms are found there. ;)
@@ -125, +130 @@

   * `spellcheck.collate=true` - to provide a query collated with the first matching suggestion.
  
  = Tips and tricks =
- * Use FSTLookup to conserve memory (unless you need a more sophisticated matching, then
look at JaspellLookup). There are some benchmarks of all three implementations: [[https://issues.apache.org/jira/browse/SOLR-1316?focusedCommentId=12873599&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12873599|SOLR-1316]]
(outdated) and a bit newer here:
+ * Use (W)FSTLookup to conserve memory (unless you need a more sophisticated matching, then
look at JaspellLookup). There are some benchmarks of all four implementations: [[https://issues.apache.org/jira/browse/SOLR-1316?focusedCommentId=12873599&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12873599|SOLR-1316]]
(outdated) and a bit newer here:
- [[https://issues.apache.org/jira/browse/SOLR-2378|SOLR-2378]]. The class to perform these
benchmarks is in the source tree and is called LookupBenchmarkTest.
+ [[https://issues.apache.org/jira/browse/SOLR-2378|SOLR-2378]], and here: [[https://issues.apache.org/jira/browse/LUCENE-3714|LUCENE-3714]].

+ The class to perform these benchmarks is in the source tree and is called LookupBenchmarkTest.
  
  * Use `threshold` parameter to limit the size of the trie, to reduce the build time and
to remove invalid/uncommon terms. Values below 0.01 should be sufficient, greater values can
be used to limit the impact of terms that occur in a larger portion of documents. Values above
0.5 probably don't make much sense.
  

Mime
View raw message