lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Dawid Weiss <dawid.we...@gmail.com>
Subject Re: SynonymFilter, FST, and Aho-Corasick algorithm
Date Thu, 12 Jul 2012 16:10:15 GMT
The development was too fast for me to keep up. And by the time i had some
concept of the api mike wrote about million lines of code that would have
to be rewritten ;)

The current api isn't bad. Its fast.

I asked for an example of what you're trying to do because then i'd be able
to tell you if what i used would work. The number of entries does not
matter. I did use fsts but simple fsts nothing special.
On Jul 12, 2012 6:05 PM, "Smiley, David W." <dsmiley@mitre.org> wrote:

>
>  On Jul 12, 2012, at 11:51 AM, Dawid Weiss wrote:
>
> Knowing how fsts work and being comfortable with the api that evolved
> through a series of exploratory patches are two different things. I like my
> fsa api much better and there was an effort to do something similar for
> lucene but i gave up at some point because the speed of development killed
> me.
>
> Do you mean it was slow to coordinate / get consensus or…?  Just curious.
>
> Can you describe what youre trying to achieve in more detail? Ive used
> fsts for pattern matching (sequences of arbitrary length) and my experience
> is that simple state trackers work wery well (even if they may seem to do
> lots of spurious tracking).
>
> I rather like Wikipedia's definition:  http://en.wikipedia.org/wiki/Aho
> –Corasick_string_matching_algorithm
>
>  The number of names I want to handle is in the millions and so use of
> Lucene's FST is essential as I see it.
>
>  ~ David
>
>
>  On Jul 12, 2012 5:09 PM, "Smiley, David W." <dsmiley@mitre.org> wrote:
>
>> Hello.
>>   I'm embarking on developing code similar to the SynonymFilter but which
>> merely needs to record out of band to the analysis where there is matching
>> text in the input tokens to the corpus in the FST.  I'm calling this a
>> "keyword tagger" in which I shove text through it and when it's done it
>> tells me at what offsets there is a match to a corpus of keyword & phrases,
>> and to what keywords/phrases they were exactly.  It doesn't have to inject
>> or modify the token stream because the results of this are going elsewhere.
>>  Although, it would be a fine approach to only omit the "tags" as I call
>> them as a way of consuming the results, but I'm not indexing them so it
>> doesn't matter.
>>
>>   I noticed the following TODOs at the start:
>>
>> // TODO: maybe we should resolve token -> wordID then run
>> // FST on wordIDs, for better perf?
>>
>> I intend on doing this since my matching keyword/phrases are often more
>> than one word, and I expect this will save memory and be faster.
>>
>> // TODO: a more efficient approach would be Aho/Corasick's
>> // algorithm
>> //
>> http://en.wikipedia.org/wiki/Aho%E2%80%93Corasick_string_matching_algorithm
>> // It improves over the current approach here
>> // because it does not fully re-start matching at every
>> // token.  For example if one pattern is "a b c x"
>> // and another is "b c d" and the input is "a b c d", on
>> // trying to parse "a b c x" but failing when you got to x,
>> // rather than starting over again your really should
>> // immediately recognize that "b c d" matches at the next
>> // input.  I suspect this won't matter that much in
>> // practice, but it's possible on some set of synonyms it
>> // will.  We'd have to modify Aho/Corasick to enforce our
>> // conflict resolving (eg greedy matching) because that algo
>> // finds all matches.  This really amounts to adding a .*
>> // closure to the FST and then determinizing it.
>>
>> Could someone please clarify how the problem in the example above is to
>> be fixed?  At the end it states how to solve it, but I don't know how to do
>> that and I'm not sure if there is anything more to it since after all if
>> it's as easy as that last sentence sounds then it would have been done
>> already ;-)
>>
>> This code is intense!  I wish FSTs were better documented.  For example,
>> there are no javadocs on public members of FST.Arc like "output" and
>> "nextFinalOutput" which are pertinent since SynonymFilter directly accesses
>> them.  IMO the state of FSTs is such that those that wrote them know how
>> they work (Robert, McCandless, Weiss) and seemingly everyone else like me
>> doesn't touch them because we don't know how.
>>
>> ~ David Smiley
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: dev-help@lucene.apache.org
>>
>>
>

Mime
View raw message