lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Smiley, David W." <dsmi...@mitre.org>
Subject Re: SynonymFilter, FST, and Aho-Corasick algorithm
Date Thu, 12 Jul 2012 16:04:44 GMT

On Jul 12, 2012, at 11:51 AM, Dawid Weiss wrote:

Knowing how fsts work and being comfortable with the api that evolved through a series of
exploratory patches are two different things. I like my fsa api much better and there was
an effort to do something similar for lucene but i gave up at some point because the speed
of development killed me.

Do you mean it was slow to coordinate / get consensus or…?  Just curious.

Can you describe what youre trying to achieve in more detail? Ive used fsts for pattern matching
(sequences of arbitrary length) and my experience is that simple state trackers work wery
well (even if they may seem to do lots of spurious tracking).

I rather like Wikipedia's definition:  http://en.wikipedia.org/wiki/Aho–Corasick_string_matching_algorithm

The number of names I want to handle is in the millions and so use of Lucene's FST is essential
as I see it.

~ David


On Jul 12, 2012 5:09 PM, "Smiley, David W." <dsmiley@mitre.org<mailto:dsmiley@mitre.org>>
wrote:
Hello.
  I'm embarking on developing code similar to the SynonymFilter but which merely needs to
record out of band to the analysis where there is matching text in the input tokens to the
corpus in the FST.  I'm calling this a "keyword tagger" in which I shove text through it and
when it's done it tells me at what offsets there is a match to a corpus of keyword & phrases,
and to what keywords/phrases they were exactly.  It doesn't have to inject or modify the token
stream because the results of this are going elsewhere.  Although, it would be a fine approach
to only omit the "tags" as I call them as a way of consuming the results, but I'm not indexing
them so it doesn't matter.

  I noticed the following TODOs at the start:

// TODO: maybe we should resolve token -> wordID then run
// FST on wordIDs, for better perf?

I intend on doing this since my matching keyword/phrases are often more than one word, and
I expect this will save memory and be faster.

// TODO: a more efficient approach would be Aho/Corasick's
// algorithm
// http://en.wikipedia.org/wiki/Aho%E2%80%93Corasick_string_matching_algorithm
// It improves over the current approach here
// because it does not fully re-start matching at every
// token.  For example if one pattern is "a b c x"
// and another is "b c d" and the input is "a b c d", on
// trying to parse "a b c x" but failing when you got to x,
// rather than starting over again your really should
// immediately recognize that "b c d" matches at the next
// input.  I suspect this won't matter that much in
// practice, but it's possible on some set of synonyms it
// will.  We'd have to modify Aho/Corasick to enforce our
// conflict resolving (eg greedy matching) because that algo
// finds all matches.  This really amounts to adding a .*
// closure to the FST and then determinizing it.

Could someone please clarify how the problem in the example above is to be fixed?  At the
end it states how to solve it, but I don't know how to do that and I'm not sure if there is
anything more to it since after all if it's as easy as that last sentence sounds then it would
have been done already ;-)

This code is intense!  I wish FSTs were better documented.  For example, there are no javadocs
on public members of FST.Arc like "output" and "nextFinalOutput" which are pertinent since
SynonymFilter directly accesses them.  IMO the state of FSTs is such that those that wrote
them know how they work (Robert, McCandless, Weiss) and seemingly everyone else like me doesn't
touch them because we don't know how.

~ David Smiley
---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org<mailto:dev-unsubscribe@lucene.apache.org>
For additional commands, e-mail: dev-help@lucene.apache.org<mailto:dev-help@lucene.apache.org>



Mime
View raw message