lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Michael McCandless <luc...@mikemccandless.com>
Subject Re: Building an automaton efficiently (CompiledAutomaton vs RunAutomaton vs Automaton)
Date Wed, 15 Feb 2017 10:49:25 GMT
Actually, that's a great idea to try (Oliver).  It would be a
relatively simple conversion... maybe Lucene could add some sugar on
top, e.g. to convert an FST<Void> to an automaton.  Hmm, maybe it even
exists somewhere already...

But even the FST Builder's NodeHash can be non-trivial in its heap
usage, but hopefully less than DaciukMihovAutomatonBuilder.

(And yes I do love how simple DaciukMihovAutomatonBuilder is).

Mike McCandless

http://blog.mikemccandless.com


On Wed, Feb 15, 2017 at 5:39 AM, Dawid Weiss <dawid.weiss@gmail.com> wrote:
> Yep, true. I just wonder whether it's worth complicating the code...
> Could be easier to build an FST<Void> and then recreate a RunAutomaton
> from that directly... :)
>
> Dawid
>
> On Wed, Feb 15, 2017 at 11:26 AM, Michael McCandless
> <lucene@mikemccandless.com> wrote:
>> We may be able to make DaciukMihovAutomatonBuilder's state registry
>> more ram efficient too ... I think it's essentially the same thing as
>> the FST.Builder's NodeHash, just minus the outputs that FSTs have vs
>> automata.
>>
>> Mike McCandless
>>
>> http://blog.mikemccandless.com
>>
>>
>> On Wed, Feb 15, 2017 at 5:14 AM, Dawid Weiss <dawid.weiss@gmail.com> wrote:
>>> You could try using morfologik's byte-based implementation:
>>>
>>> https://github.com/morfologik/morfologik-stemming/blob/master/morfologik-fsa-builders/src/test/java/morfologik/fsa/builders/FSABuilderTest.java
>>>
>>> I can't guarantee it'll be fast enough -- you need to sort those input
>>> sequences and even this may take a while. The construction of the
>>> automaton after that is fairly fast. What are the time limits you have
>>> with respect to input data sizes? Perhaps it's just unrealistic to
>>> assume everything is performed as part of a single request?
>>>
>>> Dawid
>>>
>>> On Wed, Feb 15, 2017 at 11:05 AM, Oliver Mannion <o.mannion@gmail.com>
wrote:
>>>> Hi Mike,
>>>>
>>>> Thanks for the suggestion, I've tried Operations.run on a Automaton and
>>>> it's fast enough for my use case.
>>>>
>>>> However, the real problem I have is in building the Automaton
>>>> via DaciukMihovAutomatonBuilder. On my input string set it consumes quite
a
>>>> bit of CPU, a lot of which seems to be GC activity, cleaning up about 800mb
>>>> of garbage produced during build (DaciukMihovAutomatonBuilder.States I
>>>> think). I'm using this in an app that also serves realtime requests, which
>>>> timeout during building of the Automaton. It's an inherited application
>>>> design and not the best (ideally I'd be building the Automaton offline in
>>>> another process) and I don't expect it's a use case considered for
>>>> DaciukMihovAutomatonBuilder. Unfortunately it means I've started looking
>>>> into other data structures for doing prefix matching. The dk.brics
>>>> Automaton has a similar performance profile, while the PatriciaTrie from
>>>> Apache Commons Collections seems to consume less CPU and produce less
>>>> garbage during build, although it has a less than ideal interface (it's a
>>>> Map).
>>>>
>>>> Any further suggestion though would be most welcome!
>>>>
>>>> Regards,
>>>>
>>>> Oliver
>>>>
>>>> On 14 February 2017 at 22:56, Michael McCandless <lucene@mikemccandless.com>
>>>> wrote:
>>>>
>>>>> Wow, 2G heap, that's horrible!
>>>>>
>>>>> How much heap does the automaton itself take?
>>>>>
>>>>> You can use the automaton's step method to transition from a state
>>>>> given the next input character to another state (or -1 if that state
>>>>> doesn't accept that character); it will be slower than the 2 GB run
>>>>> automaton, but perhaps fast enough?
>>>>>
>>>>> Mike McCandless
>>>>>
>>>>> http://blog.mikemccandless.com
>>>>>
>>>>>
>>>>> On Tue, Feb 14, 2017 at 6:50 AM, Oliver Mannion <o.mannion@gmail.com>
>>>>> wrote:
>>>>> > Thanks Mike for getting back to me, sounds like I'm on the right
track.
>>>>> >
>>>>> > I'm building the automaton from around 1.7million strings, and it
ends up
>>>>> > with about 3.8million states and it turns out building a
>>>>> > CharacterRunAutomaton from that takes up about 2gig of heap (I was
quite
>>>>> > suprised!), with negligible performance difference at run time.
At your
>>>>> > suggestion I tried the ByteRunAutomaton and it was similar to the
>>>>> > CharacterRunAutomaton
>>>>> > in terms of heap and run time.  So for now I'm going to just stick
to an
>>>>> > Automaton.
>>>>> >
>>>>> > On 14 February 2017 at 00:41, Michael McCandless <
>>>>> lucene@mikemccandless.com>
>>>>> > wrote:
>>>>> >
>>>>> >> On Mon, Feb 13, 2017 at 6:39 AM, Oliver Mannion <o.mannion@gmail.com>
>>>>> >> wrote:
>>>>> >>
>>>>> >> > I'd like to construct an Automaton to prefix match against
a large
>>>>> set of
>>>>> >> > strings. I gather a RunAutomation is immutable, thread
safe and faster
>>>>> >> than
>>>>> >> > Automaton.
>>>>> >>
>>>>> >> That's correct.
>>>>> >>
>>>>> >> > Are there any other differences between the three Automaton
>>>>> >> > classes, for example, in memory usage?
>>>>> >>
>>>>> >> CompiledAutomaton is just a thin wrapper to hold onto both the
>>>>> >> original Automaton and the RunAutomaton, plus some other corner-casey
>>>>> >> things that are likely not interesting for your usage.
>>>>> >>
>>>>> >> > Would the general approach for such a problem be to use
>>>>> >> > DaciukMihovAutomatonBuilder to create an Automaton from
the sorted
>>>>> list
>>>>> >> of
>>>>> >> > my string set, set all the states to accept (to enable
prefix
>>>>> matching),
>>>>> >> > then pass the Automaton into the constructor of a
>>>>> CharacterRunAutomaton,
>>>>> >> > and use the run method on the CharacterRunAutomaton to
match any
>>>>> queries?
>>>>> >>
>>>>> >> That sounds right.
>>>>> >>
>>>>> >> You could also try doing everything in UTF8 space instead, and
use
>>>>> >> ByteRunAutomaton: it will likely be faster since it will do
faster
>>>>> >> lookups on each transition.  It should still be safe to set
all states
>>>>> >> as accept, even though some of those states will be inside a
single
>>>>> >> Unicode character, as long as the strings you test against are
whole
>>>>> >> UTF-8 sequences?
>>>>> >>
>>>>> >> > As it seems like I'm building up the Automaton at least
three times,
>>>>> and
>>>>> >> > keeping a reference to the Automaton in the CharacterRunAutomaton,
is
>>>>> >> this
>>>>> >> > the most memory efficient way of building such an Automaton?
>>>>> >>
>>>>> >> Yeah, it is.  The RunAutomaton will likely require substantial
heap in
>>>>> >> your case, probably more than the original automaton.
>>>>> >>
>>>>> >> I suppose you don't actually need to keep the Automaton around
once
>>>>> >> the RunAutomaton is built, but Lucene doesn't make this possible
>>>>> >> today, since the RunAutomaton holds onto the Automaton...
>>>>> >>
>>>>> >> > Thanks in advance,
>>>>> >>
>>>>> >> You're welcome!
>>>>> >>
>>>>> >> Mike McCandless
>>>>> >>
>>>>> >> http://blog.mikemccandless.com
>>>>> >>
>>>>> >> ---------------------------------------------------------------------
>>>>> >> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>>>>> >> For additional commands, e-mail: java-user-help@lucene.apache.org
>>>>> >>
>>>>> >>
>>>>>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message