From Robert Muir <>
Subject Re: new AutomatonQuery(RunAutomaton) ?
Date Wed, 31 Aug 2011 17:44:16 GMT
On Wed, Aug 31, 2011 at 1:30 PM, eks dev <> wrote:
> I do not think it will be expensive, it is just an attempt to keep
> code smaller, simpler and marginally faster :)

I think you will find the compile is pretty fast, this only happens
once per query too (its not per-segment or anything)... see below
> those are a lot (Ca 1000) of small prefix based regex-es with limited
> alphabet compiled as RunAutomaton I load on startup and lookup from
> some RunAutomaton[] on request...
> they look like Regex("((123)|(124)|(401)|(777)|(351))[0-9]{0,2}")
> By the way, what will AutomatonQuery prefer "(XXX)[0-9]{0,2}" or
> "(XXX)[0-9]*" or "(XXX).*" ? Any performance difference?

Well, you would have to benchmark, and it definitely depends on your content.
(XXX)[0-9]{0,2} is the 'simplest' automaton in that its finite, if you
actually have (XXX)[0-9][0-9]<junk> it will seek past that.

the other two forms you listed are infinite, and when automatonquery
finds a 'loop' in the automaton, it turns itself into a 'filtering
rangequery' temporarily with the upperbound being the end of the
This prevents it from doing a lot of useless disk seeks.

if you have (XXX)[0-9]* this is going to seek to (XXX) and then act as
a range query to (XXX)a (exclusive, just indicating a is the first
valid term after the infinitely long pattern
then for each term in the range query its going to 'check' that it
matches the automaton.

(XXX).* will be similar to the above, except its going to be obviously
accept more terms, e.g. (XXX)m, and its 'range query' will be
something like (XXX)->(XXY)

> Semantically are they the same as I know that my content is only 5 digits
> I need them to
> 1. formulate complex BooleanQuery, where AutomatonQuery gets one clause
> 2. do post processing (a lot of hits) of the "query against hits" and
> this has to be fast.
> I guess, I will switch to keeping only Automaton[] and build
> RunAutomaton on the fly (per request) for fast query vs hits, this is
> done once per request only, but them I need to keep state of the
> RunAutomaton per query... makes things slightly more verbose...

AutomatonQuery computes this stuff a single time, up-front in its
constructor. Can you just reuse the AutomatonQuery(s)? in your app?
This should work fine.


