lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Oliver Mannion <o.mann...@gmail.com>
Subject Re: Building an automaton efficiently (CompiledAutomaton vs RunAutomaton vs Automaton)
Date Tue, 14 Feb 2017 11:50:02 GMT
Thanks Mike for getting back to me, sounds like I'm on the right track.

I'm building the automaton from around 1.7million strings, and it ends up
with about 3.8million states and it turns out building a
CharacterRunAutomaton from that takes up about 2gig of heap (I was quite
suprised!), with negligible performance difference at run time. At your
suggestion I tried the ByteRunAutomaton and it was similar to the
CharacterRunAutomaton
in terms of heap and run time.  So for now I'm going to just stick to an
Automaton.

On 14 February 2017 at 00:41, Michael McCandless <lucene@mikemccandless.com>
wrote:

> On Mon, Feb 13, 2017 at 6:39 AM, Oliver Mannion <o.mannion@gmail.com>
> wrote:
>
> > I'd like to construct an Automaton to prefix match against a large set of
> > strings. I gather a RunAutomation is immutable, thread safe and faster
> than
> > Automaton.
>
> That's correct.
>
> > Are there any other differences between the three Automaton
> > classes, for example, in memory usage?
>
> CompiledAutomaton is just a thin wrapper to hold onto both the
> original Automaton and the RunAutomaton, plus some other corner-casey
> things that are likely not interesting for your usage.
>
> > Would the general approach for such a problem be to use
> > DaciukMihovAutomatonBuilder to create an Automaton from the sorted list
> of
> > my string set, set all the states to accept (to enable prefix matching),
> > then pass the Automaton into the constructor of a CharacterRunAutomaton,
> > and use the run method on the CharacterRunAutomaton to match any queries?
>
> That sounds right.
>
> You could also try doing everything in UTF8 space instead, and use
> ByteRunAutomaton: it will likely be faster since it will do faster
> lookups on each transition.  It should still be safe to set all states
> as accept, even though some of those states will be inside a single
> Unicode character, as long as the strings you test against are whole
> UTF-8 sequences?
>
> > As it seems like I'm building up the Automaton at least three times, and
> > keeping a reference to the Automaton in the CharacterRunAutomaton, is
> this
> > the most memory efficient way of building such an Automaton?
>
> Yeah, it is.  The RunAutomaton will likely require substantial heap in
> your case, probably more than the original automaton.
>
> I suppose you don't actually need to keep the Automaton around once
> the RunAutomaton is built, but Lucene doesn't make this possible
> today, since the RunAutomaton holds onto the Automaton...
>
> > Thanks in advance,
>
> You're welcome!
>
> Mike McCandless
>
> http://blog.mikemccandless.com
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message