lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Dawid Weiss <dawid.we...@cs.put.poznan.pl>
Subject Re: SynonymFilter, FST, and Aho-Corasick algorithm
Date Thu, 12 Jul 2012 19:50:06 GMT
bq. I rather like Wikipedia's definition:
http://en.wikipedia.org/wiki/Aho–Corasick_string_matching_algorithm

I did a similar thing but:

1) based on entire words as individual "tokens" (instead of letter-by-letter),
2) all words present in input patterns can be encoded as a separate
data structure which maps to an unique integer
3) the matcher is essentially tracking the following:

Match {
  automaton_arc toNextNode;
  final int matchStart;
  int matchLength;
}

you then process the input word-by-word and advance each Match if
there is an arc leaving "toNextNode" and matching the current word. If
the toNextNode arc is final then you've hit a match and need to record
it (it may not be the longest match so if you only care about the
longest matches then additional processing is required).

You create new Match objects and discard mismatched existing Matches
as you process the input. Essentially, it's as if you tried to walk
down in the automaton starting on every single position in the input.
This may seem costly but in reality the matches are infrequent
compared to the input text and they are rarely very, very long (to
create lots of states). I used the above approach for entity matching
and it worked super-fast.

All this said, an Aho-Corasick transition graph would of course be
more efficient. The question is how much more efficient and how much
code/ work you'll need to put into it to make it work :)

Dawid

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


Mime
View raw message