lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Dawid Weiss <dawid.we...@cs.put.poznan.pl>
Subject Re: SynonymFilter, FST, and Aho-Corasick algorithm
Date Thu, 12 Jul 2012 19:52:18 GMT
bq. You create new Match objects and discard mismatched existing Matches

I didn't say that explicitly but obviously you don't need to create
new objects when you're doing this. The pool of match states can be
only as big as the longest pattern so you can pool them and reuse.
Zero allocation cost.

Dawid

On Thu, Jul 12, 2012 at 9:50 PM, Dawid Weiss
<dawid.weiss@cs.put.poznan.pl> wrote:
> bq. I rather like Wikipedia's definition:
> http://en.wikipedia.org/wiki/Aho–Corasick_string_matching_algorithm
>
> I did a similar thing but:
>
> 1) based on entire words as individual "tokens" (instead of letter-by-letter),
> 2) all words present in input patterns can be encoded as a separate
> data structure which maps to an unique integer
> 3) the matcher is essentially tracking the following:
>
> Match {
>   automaton_arc toNextNode;
>   final int matchStart;
>   int matchLength;
> }
>
> you then process the input word-by-word and advance each Match if
> there is an arc leaving "toNextNode" and matching the current word. If
> the toNextNode arc is final then you've hit a match and need to record
> it (it may not be the longest match so if you only care about the
> longest matches then additional processing is required).
>
> You create new Match objects and discard mismatched existing Matches
> as you process the input. Essentially, it's as if you tried to walk
> down in the automaton starting on every single position in the input.
> This may seem costly but in reality the matches are infrequent
> compared to the input text and they are rarely very, very long (to
> create lots of states). I used the above approach for entity matching
> and it worked super-fast.
>
> All this said, an Aho-Corasick transition graph would of course be
> more efficient. The question is how much more efficient and how much
> code/ work you'll need to put into it to make it work :)
>
> Dawid

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


Mime
View raw message