lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Luke Nezda <lne...@gmail.com>
Subject Levenshtein FST's?
Date Tue, 24 May 2016 00:59:32 GMT
Hello, all -

I'd like to use Lucene's automaton/FST code to achieve fast fuzzy (OSA edit
distance up to 2) search for many (10k+) strings (knowledge base: kb) in
many large strings (docs).

Approach I was thinking of: create Levenshtein FST with all paths
associated with unedited form for each kb key, union all into single fst,
search docs for matches in fst in style of SynonymFilter.

* I created 10k Levenshtein automata from kb keys  and unioned them, so
that seems tractable (took 1 minute, ~250MB ram)
* SynonymFilter code worked fine to associate output and record match token
length.
* Saw how FuzzySuggester created Levenshtein automata from query/lookup key
and intersected that with kb-like fst.

I don't see how to create Levenshtein FSTs (vs automatons) associating
outputs with unedited form, and union'ing them together.

Is this a bad idea?  Maybe better idea?

Thanks in advance,
- Luke

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message