lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Michael McCandless <luc...@mikemccandless.com>
Subject Re: Building an automaton efficiently (CompiledAutomaton vs RunAutomaton vs Automaton)
Date Tue, 14 Feb 2017 11:56:52 GMT
Wow, 2G heap, that's horrible!

How much heap does the automaton itself take?

You can use the automaton's step method to transition from a state
given the next input character to another state (or -1 if that state
doesn't accept that character); it will be slower than the 2 GB run
automaton, but perhaps fast enough?

Mike McCandless

http://blog.mikemccandless.com


On Tue, Feb 14, 2017 at 6:50 AM, Oliver Mannion <o.mannion@gmail.com> wrote:
> Thanks Mike for getting back to me, sounds like I'm on the right track.
>
> I'm building the automaton from around 1.7million strings, and it ends up
> with about 3.8million states and it turns out building a
> CharacterRunAutomaton from that takes up about 2gig of heap (I was quite
> suprised!), with negligible performance difference at run time. At your
> suggestion I tried the ByteRunAutomaton and it was similar to the
> CharacterRunAutomaton
> in terms of heap and run time.  So for now I'm going to just stick to an
> Automaton.
>
> On 14 February 2017 at 00:41, Michael McCandless <lucene@mikemccandless.com>
> wrote:
>
>> On Mon, Feb 13, 2017 at 6:39 AM, Oliver Mannion <o.mannion@gmail.com>
>> wrote:
>>
>> > I'd like to construct an Automaton to prefix match against a large set of
>> > strings. I gather a RunAutomation is immutable, thread safe and faster
>> than
>> > Automaton.
>>
>> That's correct.
>>
>> > Are there any other differences between the three Automaton
>> > classes, for example, in memory usage?
>>
>> CompiledAutomaton is just a thin wrapper to hold onto both the
>> original Automaton and the RunAutomaton, plus some other corner-casey
>> things that are likely not interesting for your usage.
>>
>> > Would the general approach for such a problem be to use
>> > DaciukMihovAutomatonBuilder to create an Automaton from the sorted list
>> of
>> > my string set, set all the states to accept (to enable prefix matching),
>> > then pass the Automaton into the constructor of a CharacterRunAutomaton,
>> > and use the run method on the CharacterRunAutomaton to match any queries?
>>
>> That sounds right.
>>
>> You could also try doing everything in UTF8 space instead, and use
>> ByteRunAutomaton: it will likely be faster since it will do faster
>> lookups on each transition.  It should still be safe to set all states
>> as accept, even though some of those states will be inside a single
>> Unicode character, as long as the strings you test against are whole
>> UTF-8 sequences?
>>
>> > As it seems like I'm building up the Automaton at least three times, and
>> > keeping a reference to the Automaton in the CharacterRunAutomaton, is
>> this
>> > the most memory efficient way of building such an Automaton?
>>
>> Yeah, it is.  The RunAutomaton will likely require substantial heap in
>> your case, probably more than the original automaton.
>>
>> I suppose you don't actually need to keep the Automaton around once
>> the RunAutomaton is built, but Lucene doesn't make this possible
>> today, since the RunAutomaton holds onto the Automaton...
>>
>> > Thanks in advance,
>>
>> You're welcome!
>>
>> Mike McCandless
>>
>> http://blog.mikemccandless.com
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>
>>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message