lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Michael McCandless <luc...@mikemccandless.com>
Subject Re: LevenshteinAutomata challenge
Date Wed, 10 Aug 2011 11:07:27 GMT
On Wed, Aug 10, 2011 at 5:07 AM, Dawid Weiss
<dawid.weiss@cs.put.poznan.pl> wrote:

> The thing you're describing is a regular composition of automata (as it
> exists, for example, when composing clauses of a regular expression). If I
> recall right the Levenshtein automaton in Lucene is built on modified brics
> code... if so then this should be not a problem.

Right, it's easy to build up the automaton for this example.  Then,
you need to do something similar to what DirectSpellChecker is doing,
but instead of FuzzyTermsEnum you'd create AutomatonTermsEnum
(Terms.intersect once LUCENE-3030 is committed), visit all terms,
collecting them in the PQ.  Maybe we should modify DirectSpellChecker
to optionally take an arbitrary automaton (expert ctor)?  Or maybe
factor out a static sugar method to "collect top N terms for this
automaton according to some scoring function", or something...

> The problem may be that
> currently automatons are used in enums in a way that skips from one accepted
> sequence to another accepted sequence (if possible). If the automaton has *
> operators then there is no way to establish these and everything falls back
> to full matching strategy.

Actually, infinite automata (ie, has cycles) are fine --
AutomatonTermsEnum attempts to .next() through such "tight" ranges of
terms, where the automaton is likely to accept more terms than the
terms dict; it uses .seekCeil for the vice/versa case (finite parts of
the term space).  LUCENE-3030 makes this more efficient for the block
tree terms dict.

Mike

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


Mime
View raw message