lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Dawid Weiss <dawid.we...@gmail.com>
Subject Re: SynonymFilter, FST, and Aho-Corasick algorithm
Date Thu, 12 Jul 2012 15:51:45 GMT
Knowing how fsts work and being comfortable with the api that evolved
through a series of exploratory patches are two different things. I like my
fsa api much better and there was an effort to do something similar for
lucene but i gave up at some point because the speed of development killed
me.

Can you describe what youre trying to achieve in more detail? Ive used fsts
for pattern matching (sequences of arbitrary length) and my experience is
that simple state trackers work wery well (even if they may seem to do lots
of spurious tracking).
On Jul 12, 2012 5:09 PM, "Smiley, David W." <dsmiley@mitre.org> wrote:

> Hello.
>   I'm embarking on developing code similar to the SynonymFilter but which
> merely needs to record out of band to the analysis where there is matching
> text in the input tokens to the corpus in the FST.  I'm calling this a
> "keyword tagger" in which I shove text through it and when it's done it
> tells me at what offsets there is a match to a corpus of keyword & phrases,
> and to what keywords/phrases they were exactly.  It doesn't have to inject
> or modify the token stream because the results of this are going elsewhere.
>  Although, it would be a fine approach to only omit the "tags" as I call
> them as a way of consuming the results, but I'm not indexing them so it
> doesn't matter.
>
>   I noticed the following TODOs at the start:
>
> // TODO: maybe we should resolve token -> wordID then run
> // FST on wordIDs, for better perf?
>
> I intend on doing this since my matching keyword/phrases are often more
> than one word, and I expect this will save memory and be faster.
>
> // TODO: a more efficient approach would be Aho/Corasick's
> // algorithm
> //
> http://en.wikipedia.org/wiki/Aho%E2%80%93Corasick_string_matching_algorithm
> // It improves over the current approach here
> // because it does not fully re-start matching at every
> // token.  For example if one pattern is "a b c x"
> // and another is "b c d" and the input is "a b c d", on
> // trying to parse "a b c x" but failing when you got to x,
> // rather than starting over again your really should
> // immediately recognize that "b c d" matches at the next
> // input.  I suspect this won't matter that much in
> // practice, but it's possible on some set of synonyms it
> // will.  We'd have to modify Aho/Corasick to enforce our
> // conflict resolving (eg greedy matching) because that algo
> // finds all matches.  This really amounts to adding a .*
> // closure to the FST and then determinizing it.
>
> Could someone please clarify how the problem in the example above is to be
> fixed?  At the end it states how to solve it, but I don't know how to do
> that and I'm not sure if there is anything more to it since after all if
> it's as easy as that last sentence sounds then it would have been done
> already ;-)
>
> This code is intense!  I wish FSTs were better documented.  For example,
> there are no javadocs on public members of FST.Arc like "output" and
> "nextFinalOutput" which are pertinent since SynonymFilter directly accesses
> them.  IMO the state of FSTs is such that those that wrote them know how
> they work (Robert, McCandless, Weiss) and seemingly everyone else like me
> doesn't touch them because we don't know how.
>
> ~ David Smiley
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: dev-help@lucene.apache.org
>
>

Mime
View raw message