lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Dawid Weiss (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (SOLR-2378) FST-based Lookup (suggestions) for prefix matches.
Date Mon, 04 Apr 2011 21:25:05 GMT

    [ https://issues.apache.org/jira/browse/SOLR-2378?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13015627#comment-13015627
] 

Dawid Weiss commented on SOLR-2378:
-----------------------------------

I'm done for the day. Lots of benchmarking and tweaking. Conclusions:
- Use raw utf16 values of input strings as they are ready to use in the input Strings and
don't need to be converted. The automaton is slightly bigger, but not that much (read: most
likely because English ascii chars fall into single byte vint range anyway).
- I created a non-recursive and recursive version of the main tail-state traversal loop, but
the gain is minimal.
- I have a relatively fast machine (i7, quad core) and parallel GC combined with tlabs makes
local allocations nearly zero-cost. Locally reused structures speed up by 5%, but increase
code complexity greatly.
- The benchmark currently in the codebase is very skewed. I tried on real life data (that
I cannot share, unfortunately) and the results are:
{noformat}
TSTLookup            size[B]=12,598,794
FSTLookup            size[B]=472,679
* Running 10 iterations for TSTLookup ...
  - warm-up 5 iterations...
  - main iterations:
TSTLookup            buildTime[ms]=19 lookupTime[ms]=4,813
* Running 10 iterations for FSTLookup ...
  - warm-up 5 iterations...
  - main iterations:
FSTLookup            buildTime[ms]=68 lookupTime[ms]=263
{noformat}
It looks as if I've made a mistake, but not really. With the actual data the fan-out rate
is much higher than on long numbers. Also, real terms are shorter than longs and 75% of their
prefix is usually a 2-3 letter combination resulting in lots of matches that need to be passed
through the priority queue, etc. With the FST weights are part of the search structure (approximated
weights, to be exact) and so the cost of building it is  higher, but the gain at runtime is
substantial. The lookup time should be virtually constant (with very small deviation).
- JaspellLookup does not pass the tests if the input prefix has a space inside (?). TSTLookup
and FSTLookup work just fine.
- I compared against automata API in Morfologik; the speedup is about 30% (only ints are passed,
no Arc object decompression, etc.). But then -- the API of Lucene is more flexible.

I'll clean up the code and create a final patch for testing/ evaluation tomorrow.

> FST-based Lookup (suggestions) for prefix matches.
> --------------------------------------------------
>
>                 Key: SOLR-2378
>                 URL: https://issues.apache.org/jira/browse/SOLR-2378
>             Project: Solr
>          Issue Type: New Feature
>          Components: spellchecker
>            Reporter: Dawid Weiss
>            Assignee: Dawid Weiss
>              Labels: lookup, prefix
>             Fix For: 4.0
>
>
> Implement a subclass of Lookup based on finite state automata/ transducers (Lucene FST
package). This issue is for implementing a relatively basic prefix matcher, we will handle
infixes and other types of input matches gradually. Impl. phases:
> - write a DFA based suggester effectively identical to ternary tree based solution right
now,
> - baseline benchmark against tern. tree (memory consumption, rebuilding speed, indexing
speed; reuse Andrzej's benchmark code)
> - modify DFA to encode term weights directly in the automaton (optimize for onlyMostPopular
case)
> - benchmark again
> - add infix suggestion support with prefix matches boosted higher (?)
> - benchmark again
> - modify the tutorial on the wiki [http://wiki.apache.org/solr/Suggester]

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


Mime
View raw message