lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Dawid Weiss (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (SOLR-2378) FST-based Lookup (suggestions) for prefix matches.
Date Tue, 05 Apr 2011 13:03:05 GMT

    [ https://issues.apache.org/jira/browse/SOLR-2378?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13015902#comment-13015902
] 

Dawid Weiss commented on SOLR-2378:
-----------------------------------

This is a git patch (trim the first level to apply without git, should work out of the box)
against the trunk that implements FST-based suggester. I've also added a more realistic benchmark
that shows performance limitations of the current suggester implementations, especially in
the "onlyMorePopular" mode. Some benchmarks from my machine:

Time to create internal data structures out of 50.000 suggestion terms (692kB of UTF8):
{noformat}
JaspellLookup   input: 50,000, time[ms]: 30 [+- 4.93]
TSTLookup       input: 50,000, time[ms]: 39 [+- 2.56]
FSTLookup       input: 50,000, time[ms]: 62 [+- 2.52]
{noformat}

Memory used:
{noformat}
JaspellLookup   size[B]:    7,868,863
TSTLookup       size[B]:    7,914,144
FSTLookup       size[B]:      300,148
{noformat}

Traversal speed (randomized prefixes of input sequences). We start with full matches:
{noformat}
-- prefixes: 100-200, num: 7, onlyMorePopular: true
JaspellLookup   queries: 50000, time[ms]: 108 [+- 6.23], ~qps: 462
TSTLookup       queries: 50000, time[ms]: 54 [+- 1.54], ~qps: 922
FSTLookup       queries: 50000, time[ms]: 66 [+- 1.30], ~qps: 761
{noformat}

These results are overly optimistic because users rarely type in something in full; let's
cut the prefix length to 6-9 chars:
{noformat}
-- prefixes: 6-9, num: 7, onlyMorePopular: true
JaspellLookup   queries: 50000, time[ms]: 155 [+- 4.20], ~qps: 322
TSTLookup       queries: 50000, time[ms]: 208 [+- 3.99], ~qps: 241
FSTLookup       queries: 50000, time[ms]: 71 [+- 1.36], ~qps: 700
{noformat}

Clearly, the FST is more resilient to the length of the input prefix... let's cut it to 2-4
characters:
{noformat}
-- prefixes: 2-4, num: 7, onlyMorePopular: true
JaspellLookup   queries: 50000, time[ms]: 420 [+- 5.37], ~qps: 119
TSTLookup       queries: 50000, time[ms]: 1687 [+- 10.96], ~qps: 30
FSTLookup       queries: 50000, time[ms]: 90 [+- 2.27], ~qps: 554
{noformat}

I didn't have the time to look into it, but TST's result is surprisingly bad with such short
prefixes. FST keeps nearly the same perf. level.

In the current implementation I throw an exception if somebody tries to get suggestions from
FST without sorting by weights (this is equivalent to building the suggester with a single
weight for all terms). This could be implemented, but at a small performance penalty -- to
be considered if you think it is useful. The above performance problems for short prefixes
are interestingly present even with onlyMorePopular set to false:
{noformat}
-- prefixes: 100-200, num: 7, onlyMorePopular: false
JaspellLookup   queries: 50000, time[ms]: 94 [+- 6.45], ~qps: 532
TSTLookup       queries: 50000, time[ms]: 46 [+- 0.98], ~qps: 1092
-- prefixes: 6-9, num: 7, onlyMorePopular: false
JaspellLookup   queries: 50000, time[ms]: 123 [+- 3.12], ~qps: 405
TSTLookup       queries: 50000, time[ms]: 188 [+- 3.03], ~qps: 266
-- prefixes: 2-4, num: 7, onlyMorePopular: false
JaspellLookup   queries: 50000, time[ms]: 225 [+- 5.69], ~qps: 222
TSTLookup       queries: 50000, time[ms]: 1523 [+- 16.69], ~qps: 33
{noformat}

Peek at the code and let me know if you can think of any improvements/ modifications, especially
wrt Solr infrastructure (I specifically didn't implement any real solr instance test case).



> FST-based Lookup (suggestions) for prefix matches.
> --------------------------------------------------
>
>                 Key: SOLR-2378
>                 URL: https://issues.apache.org/jira/browse/SOLR-2378
>             Project: Solr
>          Issue Type: New Feature
>          Components: spellchecker
>            Reporter: Dawid Weiss
>            Assignee: Dawid Weiss
>              Labels: lookup, prefix
>             Fix For: 4.0
>
>         Attachments: SOLR-2378.patch
>
>
> Implement a subclass of Lookup based on finite state automata/ transducers (Lucene FST
package). This issue is for implementing a relatively basic prefix matcher, we will handle
infixes and other types of input matches gradually. Impl. phases:
> - -write a DFA based suggester effectively identical to ternary tree based solution right
now,-
> - -baseline benchmark against tern. tree (memory consumption, rebuilding speed, indexing
speed; reuse Andrzej's benchmark code)-
> - -modify DFA to encode term weights directly in the automaton (optimize for onlyMostPopular
case)-
> - -benchmark again-
> - add infix suggestion support with prefix matches boosted higher (?)
> - benchmark again
> - modify the tutorial on the wiki [http://wiki.apache.org/solr/Suggester]

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


Mime
View raw message