lucene-solr-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Apache Wiki <wikidi...@apache.org>
Subject [Solr Wiki] Update of "Suggester" by AndrzejBialecki
Date Mon, 27 Sep 2010 22:01:48 GMT
Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Solr Wiki" for change notification.

The "Suggester" page has been changed by AndrzejBialecki.
http://wiki.apache.org/solr/Suggester?action=diff&rev1=3&rev2=4

--------------------------------------------------

      <lst name="defaults">
        <str name="spellcheck">true</str>
        <str name="spellcheck.dictionary">suggest</str>
+       <str name="spellcheck.onlyMorePopular">true</str>
+       <str name="spellcheck.count">5</str>
        <str name="spellcheck.collate">true</str>
      </lst>
      <arr name="components">
@@ -35, +37 @@

  
  The look-up of matching suggestions in a dictionary is implemented by subclasses of the
Lookup class - there are two implementations that are included in Solr, both are based on
in-memory tries: JaspellLookup and TSTLookup. Benchmarks indicate that TSTLookup provides
better performance at a lower memory cost (roughly 50% faster and 50% of memory cost) - however,
JaspellLookup can provide "fuzzy" suggestions, though this functionality is not currently
exposed (it's a one line change in JaspellLookup).
  
+ An example of an autosuggest request:
+ {{{
+ http://localhost:8983/solr/suggest?q=ac
+ }}}
+ 
+ And the corresponding response:
+ {{{
+ <?xml version="1.0" encoding="UTF-8"?>
+ <response>
+   <lst name="spellcheck">
+     <lst name="suggestions">
+       <lst name="ac">
+         <int name="numFound">2</int>
+         <int name="startOffset">0</int>
+         <int name="endOffset">2</int>
+         <arr name="suggestion">
+           <str>acquire</str>
+           <str>accommodate</str>
+         </arr>
+       </lst>
+       <str name="collation">acquire</str>
+     </lst>
+   </lst>
+ </response>
+ }}}
+ 
  = Configuration =
  The configuration snippet above shows a few common configuration parameters. Here's a complete
list of them:
  
@@ -43, +71 @@

  * `searchComponent/@name` - arbitrary name for this component
  
  * `spellchecker` list:
-   * `name` - a symbolic name of this spellchecker (can be later referred to in URL parameters)
+   * `name` - a symbolic name of this spellchecker (can be later referred to in URL parameters
and in SearchHandler configuration - see the section below)
    * `classname` - Suggester, to provide the autocomplete functionality
    * `lookupImpl` - Lookup implementation. Currently two in-memory implementations are available:
      * `org.apache.solr.suggest.tst.TSTLookup` - a simple compact ternary trie based lookup
@@ -53, +81 @@

    * `field` - if `location` is empty then terms from this field in the index will be used
when building the trie.
    * `threshold` - threshold is a value in [0..1] representing the minimum fraction of documents
(of the total) where a term should appear, in order to be added to the lookup dictionary.
  
- == Dictionary file ==
+ == Dictionary ==
- It's a plain text file in UTF-8 encoding. Blank lines and lines that start with a '#' are
ignored. The remaining lines must consist of either a string without literal TAB (\u0007)
character, or a string and a TAB separated floating-point weight.
+ When a file-based dictionary is used (non-empty `location` parameter above) then it's expected
to be a plain text file in UTF-8 encoding. Blank lines and lines that start with a '#' are
ignored. The remaining lines must consist of either a string without literal TAB (\u0007)
character, or a string and a TAB separated floating-point weight.
  
  Example:
  {{{
@@ -65, +93 @@

  accommodate\t3.0
  }}}
  
- If weight is missing it's assumed to be equal 1.0.
+ If weight is missing it's assumed to be equal 1.0. Weights affect the sorting of matching
suggestions when `spellcheck.onlyMorePopular=true` is selected - weights are treated as "popularity"
score, with higher weights preferred over suggestions with lower weights.
  
  Please note that the format of the file is not limited to single terms but can also contain
phrases - which is an improvement over the TermsComponent that you could also use for a simple
version of autocomplete functionality.
  
+ === Threshold parameter ===
+ As mentioned above, if the `location` parameter is empty then the terms from a field indicated
by the `field` parameter are used. It's often the case that due to imperfect source data there
are many uncommon or invalid terms that occur only once in the whole corpus (e.g. OCR errors,
typos, etc). According to the Zipf's law this actually forms the majority of terms, which
means that the dictionary built indiscriminately from a real-life index would consist mostly
of uncommon terms, and its size would be enormous. In order to avoid this and to reduce the
size of in-memory structures it's best to set the `threshold` parameter to a value slightly
above zero (0.5% in the example above). This already vastly reduces the size of the dictionary
by skipping [[http://en.wikipedia.org/wiki/Hapax_legomenon|"hapax legomena"]] while still
preserving most of the common terms. This parameter has no effect when using a file-based
dictionary - it's assumed that only useful terms are found there. ;)
+ 
+ == SearchHandler configuration ==
+ In the example above we add a new handler that uses SearchHandler with a single SearchComponent
that we just defined, namely the `suggest` component. Then we define a few defaults for this
component (that can be overridden with URL parameters):
+ 
+ * `spellcheck=true` - because we always want to run the Suggester for queries submitted
to this handler.
+ * `spellcheck.dictionary=suggest` - this is the name of the dictionary component that we
configured above.
+ * `spellcheck.onlyMorePopular=true` - if this parameter is set to true then the suggestions
will be sorted by weight ("popularity") - the `count` parameter will effectively limit this
to a top-N list of best suggestions. If this is set to false then suggestions are sorted alphabetically.
+ * `spellcheck.count=5` - specifies to return up to 5 suggestions.
+ * `spellcheck.collate=true` - to provide a query collated with the first matching suggestion.
+ 
+ = Tips and tricks =
+ 
+ * Use TSTLookup unless you need a more sophisticated matching from JaspellLookup. See [[https://issues.apache.org/jira/browse/SOLR-1316?focusedCommentId=12873599&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#action_12873599|benchmark
results]] - the source of this benchmark is in SuggesterTest.
+ 
+ * Use `threshold` parameter to limit the size of the trie, to reduce the build time and
to remove invalid/uncommon terms.
+ 
+ * Don't forget to invoke `spellcheck.build=true` after core reload. Or extend the Lookup
class to do this on init(), or implement the load/save methods in Lookup to persist this data
across core reloads.
+ 
+ * If you want to use a dictionary file that contains phrases (actually, strings that can
be split into multiple tokens by the default QueryConverter) then define a different QueryConverter
like this:
+ {{{
+   <!--
+   The SpellingQueryConverter to convert raw (CommonParams.Q) queries into tokens.  Uses
a simple regular expression
+    to strip off field markup, boosts, ranges, etc. but it is not guaranteed to match an
exact parse from the query parser.
+    -->
+   <queryConverter name="queryConverter" class="org.apache.solr.spelling.MySpellingQueryConverter"/>
+ }}}
+ 

Mime
View raw message