drill-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Daniel Barclay <dbarc...@maprtech.com>
Subject Re: question about UDF optimization
Date Tue, 21 Jul 2015 23:25:10 GMT

Should Drill be defaulting the other way?

That is, instead of assuming pure unless declared otherwise (leading to
wrong results in the case that the assumption is wrong (or the annotation
was forgotten)), should Drill be assuming not pure unless declared pure
(leading to only lower performance in the wrong-assumption case)?

Daniel



Jacques Nadeau wrote:
> There is an annotation on the function template.  I don't have a laptop
> close but I believe it is something similar to isRandom. It basically tells
> Drill that this is a nondeterministic function. I will be more specific
> once I get back to my machine if you don't find it sooner.
>
> Jacques
> *Summary:*
>
> Drill is very aggressive about optimizing away calls to functions with
> constant arguments. I worry that could extend to per record batch
> optimization if I accidentally have constant values and even if that
> doesn't happen, it is a pain in the ass now largely because Drill is clever
> enough to see through my attempt to hide the constant nature of my
> parameters.
>
> *Question:*
>
> Is there a way to mark a UDF as not being a pure function?
>
> *Details:*
>
> I have written a UDF to generate a random number.  It takes parameters that
> define the distribution.  All seems well and good.
>
> I find, however, that the function is only called once (twice, actually
> apparently due to pipeline warmup) and then Drill optimizes away later
> calls, apparently because the parameters to the function are constant and
> Drill thinks my function is a pure function.  If I make up some bogus data
> to pass in as a parameter, all is well and the function is called as much
> as I wanted.
>
> For instance, with the uniform distribution, my function takes two
> arguments, those being the minimum and maximum value to return.  Here is
> what I see with constants for the min and max:
>
> 0: jdbc:drill:zk=local> select random(0,10) from (values 5,5,5,5) as tbl(x);
> into eval
> into eval
> +---------------------+
> |       EXPR$0        |
> +---------------------+
> | 1.7787372583008298  |
> | 1.7787372583008298  |
> | 1.7787372583008298  |
> | 1.7787372583008298  |
> +---------------------+
>
>
> If I include an actual value, we see more interesting behavior even if the
> value is effectively constant:
>
> 0: jdbc:drill:zk=local> select random(0,x) from (values 5,5,5,5) as tbl(x);
> into eval
> into eval
> into eval
> into eval
> +----------------------+
> |        EXPR$0        |
> +----------------------+
> | 3.688377805419459    |
> | 0.2827056410711032   |
> | 2.3107479622644918   |
> | 0.10813788169218574  |
> +----------------------+
> 4 rows selected (0.088 seconds)
>
>
> Even if I make the max value come along from the sub-query, I get the evil
> behavior although the function is now surprisingly actually called three
> times, apparently to do with warming up the pipeline:
>
> 0: jdbc:drill:zk=local> select random(0,max_value) from (select 14 as
> max_value,x from (values 5,5,5,5) as tbl(x)) foo;
> into eval
> into eval
> into eval
> +---------------------+
> |       EXPR$0        |
> +---------------------+
> | 13.404462063773702  |
> | 13.404462063773702  |
> | 13.404462063773702  |
> | 13.404462063773702  |
> +---------------------+
> 4 rows selected (0.121 seconds)
>
> The UDF itself is boring and can be found at
> https://gist.github.com/tdunning/0c2cc2089e6cd8c030c0
>
> So how can I defeat this behavior?
>


-- 
Daniel Barclay
MapR Technologies

Mime
View raw message