lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Michael McCandless <>
Subject Re: Building an automaton efficiently (CompiledAutomaton vs RunAutomaton vs Automaton)
Date Tue, 14 Feb 2017 11:56:52 GMT
Wow, 2G heap, that's horrible!

How much heap does the automaton itself take?

You can use the automaton's step method to transition from a state
given the next input character to another state (or -1 if that state
doesn't accept that character); it will be slower than the 2 GB run
automaton, but perhaps fast enough?

Mike McCandless

On Tue, Feb 14, 2017 at 6:50 AM, Oliver Mannion <> wrote:
> Thanks Mike for getting back to me, sounds like I'm on the right track.
> I'm building the automaton from around 1.7million strings, and it ends up
> with about 3.8million states and it turns out building a
> CharacterRunAutomaton from that takes up about 2gig of heap (I was quite
> suprised!), with negligible performance difference at run time. At your
> suggestion I tried the ByteRunAutomaton and it was similar to the
> CharacterRunAutomaton
> in terms of heap and run time.  So for now I'm going to just stick to an
> Automaton.
> On 14 February 2017 at 00:41, Michael McCandless <>
> wrote:
>> On Mon, Feb 13, 2017 at 6:39 AM, Oliver Mannion <>
>> wrote:
>> > I'd like to construct an Automaton to prefix match against a large set of
>> > strings. I gather a RunAutomation is immutable, thread safe and faster
>> than
>> > Automaton.
>> That's correct.
>> > Are there any other differences between the three Automaton
>> > classes, for example, in memory usage?
>> CompiledAutomaton is just a thin wrapper to hold onto both the
>> original Automaton and the RunAutomaton, plus some other corner-casey
>> things that are likely not interesting for your usage.
>> > Would the general approach for such a problem be to use
>> > DaciukMihovAutomatonBuilder to create an Automaton from the sorted list
>> of
>> > my string set, set all the states to accept (to enable prefix matching),
>> > then pass the Automaton into the constructor of a CharacterRunAutomaton,
>> > and use the run method on the CharacterRunAutomaton to match any queries?
>> That sounds right.
>> You could also try doing everything in UTF8 space instead, and use
>> ByteRunAutomaton: it will likely be faster since it will do faster
>> lookups on each transition.  It should still be safe to set all states
>> as accept, even though some of those states will be inside a single
>> Unicode character, as long as the strings you test against are whole
>> UTF-8 sequences?
>> > As it seems like I'm building up the Automaton at least three times, and
>> > keeping a reference to the Automaton in the CharacterRunAutomaton, is
>> this
>> > the most memory efficient way of building such an Automaton?
>> Yeah, it is.  The RunAutomaton will likely require substantial heap in
>> your case, probably more than the original automaton.
>> I suppose you don't actually need to keep the Automaton around once
>> the RunAutomaton is built, but Lucene doesn't make this possible
>> today, since the RunAutomaton holds onto the Automaton...
>> > Thanks in advance,
>> You're welcome!
>> Mike McCandless
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail:
>> For additional commands, e-mail:

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message