lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From eks dev <>
Subject Re: new AutomatonQuery(RunAutomaton) ?
Date Wed, 31 Aug 2011 18:37:51 GMT
In my case, knowing the content of the term dictionary for this field,
no junk "(XXX).*" is the probably the best option, forcing skipping
mode... but this part is not an issue. If it ever gets slow, I could
index additional field with  common denominator vector of all XXX, but
thanks to AutomatonQuery I do not need to fiddle with indexing and
ugly term expansion/range

Keeping AutomatonQuery around came to me as an option, but do not
forget, I need Automaton (RunAutomaton) for post processing... There
is no way to get Automaton back from the AutomatonQuery?

Anyhow, I am impressed how some complex things become simple with
AutomatonQuery. For this type of query, I had to do some really
obscure combination of term expansion and, where possible, range query
... practically simulating regular expressions by hand

Automaton  it is powerful  weapon in lucene arsenal, will nuke a lot
of "black magic, hard to read" code in my code base. Did not yet start
combining Lev1/2 with regex automaton, but will do this. It  will fix
another of "do it outside of lucene" problems.

On Wed, Aug 31, 2011 at 7:44 PM, Robert Muir <> wrote:
> On Wed, Aug 31, 2011 at 1:30 PM, eks dev <> wrote:
>> I do not think it will be expensive, it is just an attempt to keep
>> code smaller, simpler and marginally faster :)
> I think you will find the compile is pretty fast, this only happens
> once per query too (its not per-segment or anything)... see below
>> those are a lot (Ca 1000) of small prefix based regex-es with limited
>> alphabet compiled as RunAutomaton I load on startup and lookup from
>> some RunAutomaton[] on request...
>> they look like Regex("((123)|(124)|(401)|(777)|(351))[0-9]{0,2}")
>> By the way, what will AutomatonQuery prefer "(XXX)[0-9]{0,2}" or
>> "(XXX)[0-9]*" or "(XXX).*" ? Any performance difference?
> Well, you would have to benchmark, and it definitely depends on your content.
> (XXX)[0-9]{0,2} is the 'simplest' automaton in that its finite, if you
> actually have (XXX)[0-9][0-9]<junk> it will seek past that.
> the other two forms you listed are infinite, and when automatonquery
> finds a 'loop' in the automaton, it turns itself into a 'filtering
> rangequery' temporarily with the upperbound being the end of the
> transition.
> This prevents it from doing a lot of useless disk seeks.
> if you have (XXX)[0-9]* this is going to seek to (XXX) and then act as
> a range query to (XXX)a (exclusive, just indicating a is the first
> valid term after the infinitely long pattern
> (XXX)999999999999999999999......)
> then for each term in the range query its going to 'check' that it
> matches the automaton.
> (XXX).* will be similar to the above, except its going to be obviously
> accept more terms, e.g. (XXX)m, and its 'range query' will be
> something like (XXX)->(XXY)
>> Semantically are they the same as I know that my content is only 5 digits
>> I need them to
>> 1. formulate complex BooleanQuery, where AutomatonQuery gets one clause
>> 2. do post processing (a lot of hits) of the "query against hits" and
>> this has to be fast.
>> I guess, I will switch to keeping only Automaton[] and build
>> RunAutomaton on the fly (per request) for fast query vs hits, this is
>> done once per request only, but them I need to keep state of the
>> RunAutomaton per query... makes things slightly more verbose...
> AutomatonQuery computes this stuff a single time, up-front in its
> constructor. Can you just reuse the AutomatonQuery(s)? in your app?
> This should work fine.
> --
> ---------------------------------------------------------------------
> To unsubscribe, e-mail:
> For additional commands, e-mail:

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message