lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Dawid Weiss <dawid.we...@cs.put.poznan.pl>
Subject Re: SynonymFilter, FST, and Aho-Corasick algorithm
Date Thu, 12 Jul 2012 20:14:41 GMT
This comment was probably made against Brics library FST
implementation because you can't really add a ".*" to the algorithm
that builds the FST incrementally is not possible because it accepts
fixed strings (and builds an already determinized automaton).

That's part of the reason my runtime solution was suboptimal -- the
tradeoff is that you can construct the FST very efficiently from
millions of input entries but don't need to manipulate it. Brics will
probably bail out with an OOM if you try to manipulate large graphs.

I'd gladly share my code because it was Lucene based... but I can't --
paid consulting job, sorry. Shouldn't be too hard to rewrite from
scratch though, really.

Dawid

On Thu, Jul 12, 2012 at 10:07 PM, Smiley, David W. <dsmiley@mitre.org> wrote:
> Thanks for your explanation, I already had a very rough idea of the
> approach.  Can Aho-Corasick be implemented with Lucene's FST?  Again, the
> SynonymFilterFactory said this RE Aho-Corasick:
> // This really amounts to adding a .*
> // closure to the FST and then determinizing it.
>
> You didn't mention FST once and that's the API I'm having trouble groking.
>
> ~ David
>
>
> On Jul 12, 2012, at 3:51 PM, Dawid Weiss [via Lucene] wrote:
>
> bq. I rather like Wikipedia's definition:
> http://en.wikipedia.org/wiki/Aho–Corasick_string_matching_algorithm
>
> I did a similar thing but:
>
> 1) based on entire words as individual "tokens" (instead of
> letter-by-letter),
> 2) all words present in input patterns can be encoded as a separate
> data structure which maps to an unique integer
> 3) the matcher is essentially tracking the following:
>
> Match {
>   automaton_arc toNextNode;
>   final int matchStart;
>   int matchLength;
> }
>
> you then process the input word-by-word and advance each Match if
> there is an arc leaving "toNextNode" and matching the current word. If
> the toNextNode arc is final then you've hit a match and need to record
> it (it may not be the longest match so if you only care about the
> longest matches then additional processing is required).
>
> You create new Match objects and discard mismatched existing Matches
> as you process the input. Essentially, it's as if you tried to walk
> down in the automaton starting on every single position in the input.
> This may seem costly but in reality the matches are infrequent
> compared to the input text and they are rarely very, very long (to
> create lots of states). I used the above approach for entity matching
> and it worked super-fast.
>
> All this said, an Aho-Corasick transition graph would of course be
> more efficient. The question is how much more efficient and how much
> code/ work you'll need to put into it to make it work :)
>
> Dawid
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


Mime
View raw message