lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Poppe, Thomas (IP&Science)" <thomas.po...@Clarivate.com>
Subject CompiledAutomaton performance issue
Date Fri, 15 Dec 2017 10:38:41 GMT
Hello,

We're using the automaton package as part of Elasticsearch for doing regexp queries.  Our
business requires us to process rather complex regular expressions, for example (we have more
complex examples, but this one illustrates the problem):

	(¦.)*(¦?[^¦]){1,10}ab(¦.)*(¦?[^¦]){1,10}c(¦.)*(¦?[^¦]){1,10}d

With a large enough value of maxDeterminizedStates, this works.  The problem we're having
is that the conversion of this regular expression to a CompiledAutomaton takes very long.
 Almost all of the time goes into determining the common suffix for the Automaton (which is
"d" in this example) - calculated with a call to Operations.getCommonSuffixBytesRef.  

If my understanding is correct, this suffix is only used as an optimization (is this correct?).
 Skipping the calculation of this suffix allows us to process these kinds of queries.

So here are my questions:
- Would it be possible to introduce a way to skip the calculation of this common suffix (ideally
something we control from within our query to Elasticsearch)?
- Or would it be possible to take a look at this getCommonSuffixBytesRef operation, to see
if it can be optimized?  Most of the time goes to determinizing the reversed automaton - maybe
this can be avoided somehow?
- Does anyone have any other suggestions?  We've tried to reduce the complexity of the query,
but it is something we would really like to support.

Thanks,
Thomas Poppe

Mime
View raw message