lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Oliver Mannion <o.mann...@gmail.com>
Subject Re: Building an automaton efficiently (CompiledAutomaton vs RunAutomaton vs Automaton)
Date Wed, 15 Feb 2017 10:05:16 GMT
Hi Mike,

Thanks for the suggestion, I've tried Operations.run on a Automaton and
it's fast enough for my use case.

However, the real problem I have is in building the Automaton
via DaciukMihovAutomatonBuilder. On my input string set it consumes quite a
bit of CPU, a lot of which seems to be GC activity, cleaning up about 800mb
of garbage produced during build (DaciukMihovAutomatonBuilder.States I
think). I'm using this in an app that also serves realtime requests, which
timeout during building of the Automaton. It's an inherited application
design and not the best (ideally I'd be building the Automaton offline in
another process) and I don't expect it's a use case considered for
DaciukMihovAutomatonBuilder. Unfortunately it means I've started looking
into other data structures for doing prefix matching. The dk.brics
Automaton has a similar performance profile, while the PatriciaTrie from
Apache Commons Collections seems to consume less CPU and produce less
garbage during build, although it has a less than ideal interface (it's a
Map).

Any further suggestion though would be most welcome!

Regards,

Oliver

On 14 February 2017 at 22:56, Michael McCandless <lucene@mikemccandless.com>
wrote:

> Wow, 2G heap, that's horrible!
>
> How much heap does the automaton itself take?
>
> You can use the automaton's step method to transition from a state
> given the next input character to another state (or -1 if that state
> doesn't accept that character); it will be slower than the 2 GB run
> automaton, but perhaps fast enough?
>
> Mike McCandless
>
> http://blog.mikemccandless.com
>
>
> On Tue, Feb 14, 2017 at 6:50 AM, Oliver Mannion <o.mannion@gmail.com>
> wrote:
> > Thanks Mike for getting back to me, sounds like I'm on the right track.
> >
> > I'm building the automaton from around 1.7million strings, and it ends up
> > with about 3.8million states and it turns out building a
> > CharacterRunAutomaton from that takes up about 2gig of heap (I was quite
> > suprised!), with negligible performance difference at run time. At your
> > suggestion I tried the ByteRunAutomaton and it was similar to the
> > CharacterRunAutomaton
> > in terms of heap and run time.  So for now I'm going to just stick to an
> > Automaton.
> >
> > On 14 February 2017 at 00:41, Michael McCandless <
> lucene@mikemccandless.com>
> > wrote:
> >
> >> On Mon, Feb 13, 2017 at 6:39 AM, Oliver Mannion <o.mannion@gmail.com>
> >> wrote:
> >>
> >> > I'd like to construct an Automaton to prefix match against a large
> set of
> >> > strings. I gather a RunAutomation is immutable, thread safe and faster
> >> than
> >> > Automaton.
> >>
> >> That's correct.
> >>
> >> > Are there any other differences between the three Automaton
> >> > classes, for example, in memory usage?
> >>
> >> CompiledAutomaton is just a thin wrapper to hold onto both the
> >> original Automaton and the RunAutomaton, plus some other corner-casey
> >> things that are likely not interesting for your usage.
> >>
> >> > Would the general approach for such a problem be to use
> >> > DaciukMihovAutomatonBuilder to create an Automaton from the sorted
> list
> >> of
> >> > my string set, set all the states to accept (to enable prefix
> matching),
> >> > then pass the Automaton into the constructor of a
> CharacterRunAutomaton,
> >> > and use the run method on the CharacterRunAutomaton to match any
> queries?
> >>
> >> That sounds right.
> >>
> >> You could also try doing everything in UTF8 space instead, and use
> >> ByteRunAutomaton: it will likely be faster since it will do faster
> >> lookups on each transition.  It should still be safe to set all states
> >> as accept, even though some of those states will be inside a single
> >> Unicode character, as long as the strings you test against are whole
> >> UTF-8 sequences?
> >>
> >> > As it seems like I'm building up the Automaton at least three times,
> and
> >> > keeping a reference to the Automaton in the CharacterRunAutomaton, is
> >> this
> >> > the most memory efficient way of building such an Automaton?
> >>
> >> Yeah, it is.  The RunAutomaton will likely require substantial heap in
> >> your case, probably more than the original automaton.
> >>
> >> I suppose you don't actually need to keep the Automaton around once
> >> the RunAutomaton is built, but Lucene doesn't make this possible
> >> today, since the RunAutomaton holds onto the Automaton...
> >>
> >> > Thanks in advance,
> >>
> >> You're welcome!
> >>
> >> Mike McCandless
> >>
> >> http://blog.mikemccandless.com
> >>
> >> ---------------------------------------------------------------------
> >> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> >> For additional commands, e-mail: java-user-help@lucene.apache.org
> >>
> >>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message