lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Smiley, David W." <dsmi...@mitre.org>
Subject Re: SynonymFilter, FST, and Aho-Corasick algorithm
Date Thu, 12 Jul 2012 20:07:01 GMT
Thanks for your explanation, I already had a very rough idea of the approach.  Can Aho-Corasick
be implemented with Lucene's FST?  Again, the SynonymFilterFactory said this RE Aho-Corasick:
// This really amounts to adding a .*
// closure to the FST and then determinizing it.

You didn't mention FST once and that's the API I'm having trouble groking.

~ David

On Jul 12, 2012, at 3:51 PM, Dawid Weiss [via Lucene] wrote:

bq. I rather like Wikipedia's definition:
http://en.wikipedia.org/wiki/Aho–Corasick_string_matching_algorithm<http://en.wikipedia.org/wiki/Aho%E2%80%93Corasick_string_matching_algorithm>

I did a similar thing but:

1) based on entire words as individual "tokens" (instead of letter-by-letter),
2) all words present in input patterns can be encoded as a separate
data structure which maps to an unique integer
3) the matcher is essentially tracking the following:

Match {
  automaton_arc toNextNode;
  final int matchStart;
  int matchLength;
}

you then process the input word-by-word and advance each Match if
there is an arc leaving "toNextNode" and matching the current word. If
the toNextNode arc is final then you've hit a match and need to record
it (it may not be the longest match so if you only care about the
longest matches then additional processing is required).

You create new Match objects and discard mismatched existing Matches
as you process the input. Essentially, it's as if you tried to walk
down in the automaton starting on every single position in the input.
This may seem costly but in reality the matches are infrequent
compared to the input text and they are rarely very, very long (to
create lots of states). I used the above approach for entity matching
and it worked super-fast.

All this said, an Aho-Corasick transition graph would of course be
more efficient. The question is how much more efficient and how much
code/ work you'll need to put into it to make it work :)

Dawid


Mime
View raw message