lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Michael McCandless <luc...@mikemccandless.com>
Subject Re: CompiledAutomaton performance issue
Date Sun, 17 Dec 2017 13:18:28 GMT
This is just an optimization; maybe we should expose an option to disable
it?

Or maybe we can find the common suffix on an NFA instead, to avoid
determinization?

Can you open a Jira issue so we can discuss options?

Thanks,

Mike McCandless

http://blog.mikemccandless.com

On Fri, Dec 15, 2017 at 5:38 AM, Poppe, Thomas (IP&Science) <
thomas.poppe@clarivate.com> wrote:

> Hello,
>
> We're using the automaton package as part of Elasticsearch for doing
> regexp queries.  Our business requires us to process rather complex regular
> expressions, for example (we have more complex examples, but this one
> illustrates the problem):
>
>         (¦.)*(¦?[^¦]){1,10}ab(¦.)*(¦?[^¦]){1,10}c(¦.)*(¦?[^¦]){1,10}d
>
> With a large enough value of maxDeterminizedStates, this works.  The
> problem we're having is that the conversion of this regular expression to a
> CompiledAutomaton takes very long.  Almost all of the time goes into
> determining the common suffix for the Automaton (which is "d" in this
> example) - calculated with a call to Operations.getCommonSuffixBytesRef.
>
> If my understanding is correct, this suffix is only used as an
> optimization (is this correct?).  Skipping the calculation of this suffix
> allows us to process these kinds of queries.
>
> So here are my questions:
> - Would it be possible to introduce a way to skip the calculation of this
> common suffix (ideally something we control from within our query to
> Elasticsearch)?
> - Or would it be possible to take a look at this getCommonSuffixBytesRef
> operation, to see if it can be optimized?  Most of the time goes to
> determinizing the reversed automaton - maybe this can be avoided somehow?
> - Does anyone have any other suggestions?  We've tried to reduce the
> complexity of the query, but it is something we would really like to
> support.
>
> Thanks,
> Thomas Poppe
>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message