lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Smiley, David W." <dsmi...@mitre.org>
Subject SynonymFilter, FST, and Aho-Corasick algorithm
Date Thu, 12 Jul 2012 15:08:47 GMT
Hello.
  I'm embarking on developing code similar to the SynonymFilter but which merely needs to
record out of band to the analysis where there is matching text in the input tokens to the
corpus in the FST.  I'm calling this a "keyword tagger" in which I shove text through it and
when it's done it tells me at what offsets there is a match to a corpus of keyword & phrases,
and to what keywords/phrases they were exactly.  It doesn't have to inject or modify the token
stream because the results of this are going elsewhere.  Although, it would be a fine approach
to only omit the "tags" as I call them as a way of consuming the results, but I'm not indexing
them so it doesn't matter.

  I noticed the following TODOs at the start:

// TODO: maybe we should resolve token -> wordID then run
// FST on wordIDs, for better perf?

I intend on doing this since my matching keyword/phrases are often more than one word, and
I expect this will save memory and be faster.

// TODO: a more efficient approach would be Aho/Corasick's
// algorithm
// http://en.wikipedia.org/wiki/Aho%E2%80%93Corasick_string_matching_algorithm
// It improves over the current approach here
// because it does not fully re-start matching at every
// token.  For example if one pattern is "a b c x"
// and another is "b c d" and the input is "a b c d", on
// trying to parse "a b c x" but failing when you got to x,
// rather than starting over again your really should
// immediately recognize that "b c d" matches at the next
// input.  I suspect this won't matter that much in
// practice, but it's possible on some set of synonyms it
// will.  We'd have to modify Aho/Corasick to enforce our
// conflict resolving (eg greedy matching) because that algo
// finds all matches.  This really amounts to adding a .*
// closure to the FST and then determinizing it.

Could someone please clarify how the problem in the example above is to be fixed?  At the
end it states how to solve it, but I don't know how to do that and I'm not sure if there is
anything more to it since after all if it's as easy as that last sentence sounds then it would
have been done already ;-)

This code is intense!  I wish FSTs were better documented.  For example, there are no javadocs
on public members of FST.Arc like "output" and "nextFinalOutput" which are pertinent since
SynonymFilter directly accesses them.  IMO the state of FSTs is such that those that wrote
them know how they work (Robert, McCandless, Weiss) and seemingly everyone else like me doesn't
touch them because we don't know how.

~ David Smiley
---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


Mime
View raw message