lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Michael McCandless <>
Subject Re: SynonymFilter, FST, and Aho-Corasick algorithm
Date Thu, 12 Jul 2012 22:09:47 GMT
Some responses below:

Mike McCandless

On Thu, Jul 12, 2012 at 11:08 AM, Smiley, David W. <> wrote:
> Hello.
>   I'm embarking on developing code similar to the SynonymFilter but which merely needs
to record out of band to the analysis where there is matching text in the input tokens to
the corpus in the FST.  I'm calling this a "keyword tagger" in which I shove text through
it and when it's done it tells me at what offsets there is a match to a corpus of keyword
& phrases, and to what keywords/phrases they were exactly.  It doesn't have to inject
or modify the token stream because the results of this are going elsewhere.  Although, it
would be a fine approach to only omit the "tags" as I call them as a way of consuming the
results, but I'm not indexing them so it doesn't matter.
>   I noticed the following TODOs at the start:
> // TODO: maybe we should resolve token -> wordID then run
> // FST on wordIDs, for better perf?
> I intend on doing this since my matching keyword/phrases are often more than one word,
and I expect this will save memory and be faster.

Be sure to test this is really faster: you'll need to add a step to
resolve word -> id (eg via hashmap) which may net/net add cost because
the FST can incrementally (quickly) determine a word doesn't exist
with a given prefix.  FST can also do better sharing (less RAM) of
shared prefixes/suffixes.

> // TODO: a more efficient approach would be Aho/Corasick's
> // algorithm
> //
> // It improves over the current approach here
> // because it does not fully re-start matching at every
> // token.  For example if one pattern is "a b c x"
> // and another is "b c d" and the input is "a b c d", on
> // trying to parse "a b c x" but failing when you got to x,
> // rather than starting over again your really should
> // immediately recognize that "b c d" matches at the next
> // input.  I suspect this won't matter that much in
> // practice, but it's possible on some set of synonyms it
> // will.  We'd have to modify Aho/Corasick to enforce our
> // conflict resolving (eg greedy matching) because that algo
> // finds all matches.  This really amounts to adding a .*
> // closure to the FST and then determinizing it.
> Could someone please clarify how the problem in the example above is to be fixed?  At
the end it states how to solve it, but I don't know how to do that and I'm not sure if there
is anything more to it since after all if it's as easy as that last sentence sounds then it
would have been done already ;-)

The FSTs we create are not "malleable" so implementing what that crazy
comment says would not be easy.

However, there is a cool paper that Robert found:

That I think does not require heavily modifying the minimal FST (just
augmenting it w/ additional arcs that you follow on failure to match).
 I think it's basically Aho Corasick, done as an FST (which eg you can
then compose with other FSTs to compile a chain of replacements into a
single FST ... at least this was my quick understanding).

Still, I would first try the obvious approach (use FST the way
SynFilter does) and see if it's fast enough.  I think Aho Corasick
only really matters if your patterns have high overlap after shifting
(eg aaaab and aaaaab).

Mike McCandless

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message