lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Michael McCandless <luc...@mikemccandless.com>
Subject Re: DirectSpellChecker.suggestSimilar() scans TermEnum. but why?
Date Sun, 30 Dec 2012 00:15:36 GMT
On Sat, Dec 29, 2012 at 9:58 AM, Mikhail Khludnev
<mkhludnev@griddynamics.com> wrote:
> Happy New Year, Devs!
>
> Excuse me for the noob's question. I'm not able to get deep into FST
> internals. I run trivial benchmark and not really enjoyed by the results.
>
> I'm looking for the ultra-fast spelling correction. Right now I use 3.x
> SpellChecker which is backed on separate Lucene Ngram index.FWIW, it's
> persistent, not in RAMDirectory. Now the bottleneck is I/O. Reading that
> Lucene Ngram index takes too much time. I guess it might be solved by
> loading Lucene Ngram index into RAMDirectory, but I want to exploit FST
> spell check from 4.0.
>
> What I see, and what makes me wonder. Every
> DirectSpellChecker.suggestSimilar() creates new FuzzyTermsEnum and every
> time it scans the termsEnum by FilteredTermsEnum.next(). And here I hit the
> same slow IO bummer. It might be necessary detail: I read 3.x index by 4.0
> code. I don't think it changes something.

Actually, it does: when 4.x reads a 3.x index it has some non-trivial
code to handle the reordering of terms from UTF16 to Unicode sort
order.  So before concluding anything about the results you should
test on a new 4.0 index ...

> I don't know anything about FST, but I've thought that it's a compact graph
> of syllables, which is visited for finding string similar to the given i.e.
> I expect it won't scan termsEnum for every lookup.

It would be possible to create an FST and do fuzzy lookup directly
from that ... to "approximate" that you could try using
MemoryPostingsFormat (stores all tersm + docs in an FST).  That should
avoid all IO (assuming your OS never swaps out your process RAM ;) ),
but it will be a (maybe sizable) lower bound on the perf you'd get
with a dedicated Fuzzy search on an FST ...

Mike McCandless

http://blog.mikemccandless.com

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


Mime
View raw message