lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Oliver Mannion <o.mann...@gmail.com>
Subject Re: Building an automaton efficiently (CompiledAutomaton vs RunAutomaton vs Automaton)
Date Sun, 19 Feb 2017 23:34:40 GMT
Hi David & Mike,

Thanks for the suggestions!

Regarding my requirements for building the FSA, it happens in a background
process, not as part of a request, although the app is still serving
requests at the same time. So the time taken to build isn't really the
issue, it's just that it shouldn't starve the other requests of CPU.

I've tried building an FST<Object> with NoOutputs, and morfologik's FSA,
and both are much better in terms of build time, CPU and RAM usage compared
to Automaton. They are similar to, and in some dimensions even better than,
the PatriciaTrie. In particular building an FST with doShareSuffix = false
is the fastest of any option, and morfologik's fsa has the lowest heap
usage of all (both in terms of garbage and the final size of the FSA).
Either will meet my requirements. I quite like the API provided by
morfologik (no need to work with BytesRef) and it supports both prefix
(MatchResult.SEQUENCE_IS_A_PREFIX) and exact matching
(MatchResult.EXACT_MATCH).

Given the efficiencies and functionality similarity of the FST vs
Automation in Lucene, I'm kind of curious as to why you would ever use
Automaton?

Thanks very much for your help!
Oliver

On 15 February 2017 at 21:49, Michael McCandless <lucene@mikemccandless.com>
wrote:

> Actually, that's a great idea to try (Oliver).  It would be a
> relatively simple conversion... maybe Lucene could add some sugar on
> top, e.g. to convert an FST<Void> to an automaton.  Hmm, maybe it even
> exists somewhere already...
>
> But even the FST Builder's NodeHash can be non-trivial in its heap
> usage, but hopefully less than DaciukMihovAutomatonBuilder.
>
> (And yes I do love how simple DaciukMihovAutomatonBuilder is).
>
> Mike McCandless
>
> http://blog.mikemccandless.com
>
>
> On Wed, Feb 15, 2017 at 5:39 AM, Dawid Weiss <dawid.weiss@gmail.com>
> wrote:
> > Yep, true. I just wonder whether it's worth complicating the code...
> > Could be easier to build an FST<Void> and then recreate a RunAutomaton
> > from that directly... :)
> >
> > Dawid
> >
> > On Wed, Feb 15, 2017 at 11:26 AM, Michael McCandless
> > <lucene@mikemccandless.com> wrote:
> >> We may be able to make DaciukMihovAutomatonBuilder's state registry
> >> more ram efficient too ... I think it's essentially the same thing as
> >> the FST.Builder's NodeHash, just minus the outputs that FSTs have vs
> >> automata.
> >>
> >> Mike McCandless
> >>
> >> http://blog.mikemccandless.com
> >>
> >>
> >> On Wed, Feb 15, 2017 at 5:14 AM, Dawid Weiss <dawid.weiss@gmail.com>
> wrote:
> >>> You could try using morfologik's byte-based implementation:
> >>>
> >>> https://github.com/morfologik/morfologik-stemming/blob/
> master/morfologik-fsa-builders/src/test/java/morfologik/fsa/builders/
> FSABuilderTest.java
> >>>
> >>> I can't guarantee it'll be fast enough -- you need to sort those input
> >>> sequences and even this may take a while. The construction of the
> >>> automaton after that is fairly fast. What are the time limits you have
> >>> with respect to input data sizes? Perhaps it's just unrealistic to
> >>> assume everything is performed as part of a single request?
> >>>
> >>> Dawid
> >>>
> >>> On Wed, Feb 15, 2017 at 11:05 AM, Oliver Mannion <o.mannion@gmail.com>
> wrote:
> >>>> Hi Mike,
> >>>>
> >>>> Thanks for the suggestion, I've tried Operations.run on a Automaton
> and
> >>>> it's fast enough for my use case.
> >>>>
> >>>> However, the real problem I have is in building the Automaton
> >>>> via DaciukMihovAutomatonBuilder. On my input string set it consumes
> quite a
> >>>> bit of CPU, a lot of which seems to be GC activity, cleaning up about
> 800mb
> >>>> of garbage produced during build (DaciukMihovAutomatonBuilder.States
> I
> >>>> think). I'm using this in an app that also serves realtime requests,
> which
> >>>> timeout during building of the Automaton. It's an inherited
> application
> >>>> design and not the best (ideally I'd be building the Automaton
> offline in
> >>>> another process) and I don't expect it's a use case considered for
> >>>> DaciukMihovAutomatonBuilder. Unfortunately it means I've started
> looking
> >>>> into other data structures for doing prefix matching. The dk.brics
> >>>> Automaton has a similar performance profile, while the PatriciaTrie
> from
> >>>> Apache Commons Collections seems to consume less CPU and produce less
> >>>> garbage during build, although it has a less than ideal interface
> (it's a
> >>>> Map).
> >>>>
> >>>> Any further suggestion though would be most welcome!
> >>>>
> >>>> Regards,
> >>>>
> >>>> Oliver
> >>>>
> >>>> On 14 February 2017 at 22:56, Michael McCandless <
> lucene@mikemccandless.com>
> >>>> wrote:
> >>>>
> >>>>> Wow, 2G heap, that's horrible!
> >>>>>
> >>>>> How much heap does the automaton itself take?
> >>>>>
> >>>>> You can use the automaton's step method to transition from a state
> >>>>> given the next input character to another state (or -1 if that state
> >>>>> doesn't accept that character); it will be slower than the 2 GB
run
> >>>>> automaton, but perhaps fast enough?
> >>>>>
> >>>>> Mike McCandless
> >>>>>
> >>>>> http://blog.mikemccandless.com
> >>>>>
> >>>>>
> >>>>> On Tue, Feb 14, 2017 at 6:50 AM, Oliver Mannion <o.mannion@gmail.com
> >
> >>>>> wrote:
> >>>>> > Thanks Mike for getting back to me, sounds like I'm on the
right
> track.
> >>>>> >
> >>>>> > I'm building the automaton from around 1.7million strings,
and it
> ends up
> >>>>> > with about 3.8million states and it turns out building a
> >>>>> > CharacterRunAutomaton from that takes up about 2gig of heap
(I was
> quite
> >>>>> > suprised!), with negligible performance difference at run time.
At
> your
> >>>>> > suggestion I tried the ByteRunAutomaton and it was similar
to the
> >>>>> > CharacterRunAutomaton
> >>>>> > in terms of heap and run time.  So for now I'm going to just
stick
> to an
> >>>>> > Automaton.
> >>>>> >
> >>>>> > On 14 February 2017 at 00:41, Michael McCandless <
> >>>>> lucene@mikemccandless.com>
> >>>>> > wrote:
> >>>>> >
> >>>>> >> On Mon, Feb 13, 2017 at 6:39 AM, Oliver Mannion <
> o.mannion@gmail.com>
> >>>>> >> wrote:
> >>>>> >>
> >>>>> >> > I'd like to construct an Automaton to prefix match
against a
> large
> >>>>> set of
> >>>>> >> > strings. I gather a RunAutomation is immutable, thread
safe and
> faster
> >>>>> >> than
> >>>>> >> > Automaton.
> >>>>> >>
> >>>>> >> That's correct.
> >>>>> >>
> >>>>> >> > Are there any other differences between the three
Automaton
> >>>>> >> > classes, for example, in memory usage?
> >>>>> >>
> >>>>> >> CompiledAutomaton is just a thin wrapper to hold onto both
the
> >>>>> >> original Automaton and the RunAutomaton, plus some other
> corner-casey
> >>>>> >> things that are likely not interesting for your usage.
> >>>>> >>
> >>>>> >> > Would the general approach for such a problem be to
use
> >>>>> >> > DaciukMihovAutomatonBuilder to create an Automaton
from the
> sorted
> >>>>> list
> >>>>> >> of
> >>>>> >> > my string set, set all the states to accept (to enable
prefix
> >>>>> matching),
> >>>>> >> > then pass the Automaton into the constructor of a
> >>>>> CharacterRunAutomaton,
> >>>>> >> > and use the run method on the CharacterRunAutomaton
to match any
> >>>>> queries?
> >>>>> >>
> >>>>> >> That sounds right.
> >>>>> >>
> >>>>> >> You could also try doing everything in UTF8 space instead,
and use
> >>>>> >> ByteRunAutomaton: it will likely be faster since it will
do faster
> >>>>> >> lookups on each transition.  It should still be safe to
set all
> states
> >>>>> >> as accept, even though some of those states will be inside
a
> single
> >>>>> >> Unicode character, as long as the strings you test against
are
> whole
> >>>>> >> UTF-8 sequences?
> >>>>> >>
> >>>>> >> > As it seems like I'm building up the Automaton at
least three
> times,
> >>>>> and
> >>>>> >> > keeping a reference to the Automaton in the
> CharacterRunAutomaton, is
> >>>>> >> this
> >>>>> >> > the most memory efficient way of building such an
Automaton?
> >>>>> >>
> >>>>> >> Yeah, it is.  The RunAutomaton will likely require substantial
> heap in
> >>>>> >> your case, probably more than the original automaton.
> >>>>> >>
> >>>>> >> I suppose you don't actually need to keep the Automaton
around
> once
> >>>>> >> the RunAutomaton is built, but Lucene doesn't make this
possible
> >>>>> >> today, since the RunAutomaton holds onto the Automaton...
> >>>>> >>
> >>>>> >> > Thanks in advance,
> >>>>> >>
> >>>>> >> You're welcome!
> >>>>> >>
> >>>>> >> Mike McCandless
> >>>>> >>
> >>>>> >> http://blog.mikemccandless.com
> >>>>> >>
> >>>>> >> ------------------------------------------------------------
> ---------
> >>>>> >> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> >>>>> >> For additional commands, e-mail: java-user-help@lucene.apache.org
> >>>>> >>
> >>>>> >>
> >>>>>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message