lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Michael McCandless <luc...@mikemccandless.com>
Subject Re: Building an automaton efficiently (CompiledAutomaton vs RunAutomaton vs Automaton)
Date Mon, 13 Feb 2017 13:41:02 GMT
On Mon, Feb 13, 2017 at 6:39 AM, Oliver Mannion <o.mannion@gmail.com> wrote:

> I'd like to construct an Automaton to prefix match against a large set of
> strings. I gather a RunAutomation is immutable, thread safe and faster than
> Automaton.

That's correct.

> Are there any other differences between the three Automaton
> classes, for example, in memory usage?

CompiledAutomaton is just a thin wrapper to hold onto both the
original Automaton and the RunAutomaton, plus some other corner-casey
things that are likely not interesting for your usage.

> Would the general approach for such a problem be to use
> DaciukMihovAutomatonBuilder to create an Automaton from the sorted list of
> my string set, set all the states to accept (to enable prefix matching),
> then pass the Automaton into the constructor of a CharacterRunAutomaton,
> and use the run method on the CharacterRunAutomaton to match any queries?

That sounds right.

You could also try doing everything in UTF8 space instead, and use
ByteRunAutomaton: it will likely be faster since it will do faster
lookups on each transition.  It should still be safe to set all states
as accept, even though some of those states will be inside a single
Unicode character, as long as the strings you test against are whole
UTF-8 sequences?

> As it seems like I'm building up the Automaton at least three times, and
> keeping a reference to the Automaton in the CharacterRunAutomaton, is this
> the most memory efficient way of building such an Automaton?

Yeah, it is.  The RunAutomaton will likely require substantial heap in
your case, probably more than the original automaton.

I suppose you don't actually need to keep the Automaton around once
the RunAutomaton is built, but Lucene doesn't make this possible
today, since the RunAutomaton holds onto the Automaton...

> Thanks in advance,

You're welcome!

Mike McCandless

http://blog.mikemccandless.com

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message