lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Michael McCandless <luc...@mikemccandless.com>
Subject Re: Finding a match for an automaton against a FST
Date Mon, 12 Jan 2015 13:45:12 GMT
On Sat, Jan 10, 2015 at 8:23 AM, Olivier Binda <olivier.binda@wanadoo.fr> wrote:
> On 01/10/2015 11:00 AM, Michael McCandless wrote:
>>
>> On Fri, Jan 9, 2015 at 6:42 AM, Olivier Binda <olivier.binda@wanadoo.fr>
>> wrote:
>>>
>>> Hello.
>>>
>>> 1) What is the best way to check if an automaton (from a regex or a
>>> string
>>> with a wildcard)
>>> has at least 1 match against a FST (from a WFSTCompletionLookup) ?
>>
>> You need to implement "intersect".  We already have this method for
>> two automata (Operations.java); maybe you can start from that but
>> cutover to the FST APIs instead for the 2nd automaton?
>
>
> I looked a bit into this. This is complicated stuff :/

Sorry, yes it is.  If you have any ideas to simplify the APIs that
would be awesome :)

> I think I get what the nested loops in intersect() do  : transitions consist
> of a double dimension array, somehow those arays are intersected.
> I don't understand yet why is there a .min and a .max for a transition (why
> not just a codepoint ?)

Most automaton transitions cover a wide range of unicode characters,
so requiring a separate transition for each would be too costly (too
many objects / RAM).

> Fst and automaton (and maybee the lucene codec stuff) are 3 different
> implementations
> of finite state machine/transducers, right ?

I think we have only 2 implementations (FST, Automaton).

> How does regexQuery (automaton) match against an index ?
> Does it use intersect() internally ?  (if it does, maybee I could reuse that
> code too)

RegexpQuery in core (NOT to be confused with the much slower,
differing in name by only one letter, RegexQuery in sandbox) builds an
Automaton and then uses Terms.intersect API.

However, I would not look for inspiration from Terms.intersect: that
implementation (in block tree terms dict) works with the terms
dictionary data structures to perform a fast intersection and that
code is crazy complex.

Possibly a place to look for inspiration/poaching is
FSTUtil.intersectPrefixPaths: that intersects an automaton with an
FST.  It's used by the fuzzy suggester...

Mike McCandless

http://blog.mikemccandless.com



>
>
>>
>>> 2) Also, is there a simple/efficient way to find the lowest and the
>>> highest
>>> arcs of a FST that match against  an automaton ?
>>
>> Hmm arcs leaving which state?  The initial state?  You could simply
>> walk all arcs leaving the initial state from the FST and check if the
>> automaton accepts them leaving its initial state (assuming the
>> automaton has no dead states)?
>>
>> Or, if you are already doing an intersection here, just save this
>> information as a side effect since you will have already computed it.
>
>
> thanks for the tips, it helps.
> Olivier
>
>
>
>> Mike McCandless
>>
>> http://blog.mikemccandless.com
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>
>>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message