arrow-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jorge Cardoso Leitão <jorgecarlei...@gmail.com>
Subject Re: Polymorphism in DataFusion
Date Mon, 17 Aug 2020 20:51:55 GMT
Thanks Andrew,

I am not sure I articulated this well enough, though, as I did not specify
the type of polymorphism that I was thinking about. xD

My question was/is about whether we should accept functions whose return
type is known during planning, and constant during execution, or whether
their return types must be constant both during planning and execution. I
do not think we should support variable types during execution for the
reasons that you enumerated. If by runtime polymorphism you mean changing
types during execution, then I very much agree with you that that is a
no-no.

During planning, though, we have options: should we allow users to write
something like `my_operation(f32|f64) -> (f32|f64)`, on which the type is
inferred after we know the function's input in the logical plan, or should
we not allow that and require users to register `my_operation_f32(f32)` and
`my_operation_f64(f64)` separately? The three findings that I mentioned
above refer to planned polymorphism: return type is resolved during
planning (and constant during execution).

The biggest use-case IMO for polymorphism during planning is for functions
that yield structures/lists of values (a-la collect_list) whose type can
only be inferred after we know the functions' input type (array(f32) vs
array(f64)), and whose implementation can be generalized via a macro +
match.

>From a technical point of view, we currently have functions with variable
types (all binary operators' return type depends on the lhs' type, sum,
max/min, etc.), and we have to handle the main planning challenges already.
In this context, the questions are something like:

1. should we continue to have them or should we move away from them?
2.1 If not, what should we do with them? E.g. declare sum_i32, sum_i64,
etc., that have a single return type?
2.2 if yes, show we allow users to register these types of functions, or
should these only be allowed within DataFusion's code base?

Best,
Jorge



On Mon, Aug 17, 2020 at 9:53 PM Andrew Lamb <alamb@influxdata.com> wrote:

> In my opinion, I suggest we do not continue down the path of (runtime)
> polymorphic functions unless a compelling use case for them can be
> articulated.
>
> You have done a great job articulating some of the implementation
> challenges, but I personally struggle to describe when, as a user of
> DataFusion, I would want to write a (runtime) polymorphic function.
>
> A function with runtime polymorphism I think would mean the UDF could
> handle the type changing *at runtime*: record batches could come in with
> multiple different types during the same execution. I can't think of
> examples where this behavior would be desirable or necessary.
>
> The existing DataFusion codebase seems to assume (reasonably in my opinion)
> that the schema of each Logical / Physical plan node is known at planning
> time and it does not change at runtime.
>
> Most query optimizers (and compilers for that matter) take advantage of
> plan (compile) time type information to make runtime more efficient.  Also,
> it seems like other database / runtime systems such as mysql[1] and
> postgres[2] require the UDF creator to explicitly specify the return type
> as well. I think we should consider the simpler semantics of "1 return type
> for each UDF" to make it easier on people writing UDFs as well as
> simplifying the implementation of DataFusion itself.
>
> Andrew
>
> [1] https://dev.mysql.com/doc/refman/8.0/en/create-function-udf.html
> [2] https://www.postgresql.org/docs/12/sql-createfunction.html
>
> On Mon, Aug 17, 2020 at 12:31 PM Jorge Cardoso Leitão <
> jorgecarleitao@gmail.com> wrote:
>
> > Hi,
> >
> > Recently, I have been contributing to DataFusion, and I would like to
> bring
> > to your attention a question that I faced while PRing to DataFusion that
> > IMO needs some alignment :)
> >
> > DataFusion supports scalar UDFs: functions that expect a type, return a
> > type, and performs some operation on the data (a-la spark UDF). However,
> > the execution engine is actually dynamically typed:
> >
> > * a scalar UDF receives an &[ArrayRef] that must be downcasted
> accordingly
> > * a scalar UDF must select the builder that matches its signature, so
> that
> > its return type matches the ArrayRef that it returns.
> >
> > This suggests that we can treat functions as polymorphic: as long as the
> > function handles the different types (e.g. via match), we are good. We
> > currently do not support multiple input types nor variable return types
> in
> > their function signatures.
> >
> > Our current (non-udf) scalar and aggregate functions are already
> > polymorphic on both their input and return type: sum(i32) -> i64,
> sum(f64)
> > -> f64, "a + b". I have been working on PRs to support polymorphic
> support
> > to scalar UDFs (e.g. sqrt() can take float32 and float64) [1,3], as well
> as
> > polymorphic aggregate UDFs [2], so that we can extend our offering to
> more
> > interesting functions such as "length(t) -> uint", "array(c1, c2)",
> > "collect_list(t) -> array(t)", etc.
> >
> > However, while working on [1,2,3], I reach some non-trivial findings
> that I
> > would like to share:
> >
> > Finding 1: to support polymorphic functions, our logical and physical
> > expressions (Expr and PhysicalExpr) need to be polymorphic as-well: once
> a
> > function is polymorphic, any expression containing it is also
> polymorphic.
> >
> > Finding 2: when a polymorphic expression passes through our type coercer
> > optimizer (that tries to coerce types to match a function's signature),
> it
> > may be re-casted to a different type. If the return type changes, the
> > optimizer may need to re-cast operations dependent of the function call
> > (e.g. a projection followed by an aggregation may need a recast on the
> > projection and on the aggregation).
> >
> > Finding 3: when an expression passes through our type coercer optimizer
> and
> > is re-casted, its name changes (typically from "expr" to "CAST(expr as
> > X)"). This implies that a column referenced as #expr down the plan may
> not
> > exist depending on the input type of the initial projection/scan.
> >
> > Finding 1 and 2 IMO are a direct consequence of polymorphism and the only
> > way to not handle them is by not supporting polymorphism (e.g. the user
> > registers sqrt_f32 and sqrt_f64, etc).
> >
> > Finding 3 can be addressed in at least three ways:
> >
> > A) make the optimizer rewrite the expression as "CAST(expr as X) AS
> expr",
> > so that it retains its original name. This hides the actual expression's
> > calculation, but preserves its original name.
> > B) accept that expressions can always change its name, which means that
> the
> > user should be mindful when writing `col("SELECT sqrt(x) FROM t"`, as the
> > column name may end up being called `"sqrt(CAST(x as X))"`.
> > C) Do not support polymorphic functions
> >
> > Note that we currently already experience effects 1-3, it is just that we
> > use so few polymorphic functions that these seldomly present themselves.
> It
> > was while working on [1,2,3] that I start painting the bigger picture.
> >
> > Some questions:
> > 1. should continue down the path of polymorphic functions?
> > 2. if yes, how do handle finding 3?
> >
> > Looking at the current code base, I am confident that we can address the
> > technical issues to support polymorphic functions. However, it would be
> > interesting to have your thoughts on this.
> >
> > [1] https://github.com/apache/arrow/pull/7967
> > [2] https://github.com/apache/arrow/pull/7971
> > [3] https://github.com/apache/arrow/pull/7974
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message