arrow-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Andrew Lamb <al...@influxdata.com>
Subject Re: Polymorphism in DataFusion
Date Tue, 18 Aug 2020 12:04:18 GMT
It is my personal opinion that actual UDF functions  registered with data
fusion should take a known set of input types and single return type (e.g.
sum_i32 --> i32). I think this would:
1. Simplify the implementation of both the DataFusion optimizer and the UDFs
2. Make it easier for UDF writers as the UDF code would look more like
Rust: the types would be clear from the function signatures, as is the case
in Rust in general
3. Give the user of SQL / DataFrames the ability to specifically specify
what types they want

If we wanted the ability for the user to specify `sum(i)` and let the type
coercion pass pick `sum_i32` or `sum_i64` depending on the input types, I
recommend doing that at a different level than the UDF (perhaps via
`register_alias("sum", "sum_i32)` or something), again for both clarity of
DataFusion implementation as well as UDF specification.

Andrew

On Mon, Aug 17, 2020 at 4:52 PM Jorge Cardoso Leitão <
jorgecarleitao@gmail.com> wrote:

> Thanks Andrew,
>
> I am not sure I articulated this well enough, though, as I did not specify
> the type of polymorphism that I was thinking about. xD
>
> My question was/is about whether we should accept functions whose return
> type is known during planning, and constant during execution, or whether
> their return types must be constant both during planning and execution. I
> do not think we should support variable types during execution for the
> reasons that you enumerated. If by runtime polymorphism you mean changing
> types during execution, then I very much agree with you that that is a
> no-no.
>
> During planning, though, we have options: should we allow users to write
> something like `my_operation(f32|f64) -> (f32|f64)`, on which the type is
> inferred after we know the function's input in the logical plan, or should
> we not allow that and require users to register `my_operation_f32(f32)` and
> `my_operation_f64(f64)` separately? The three findings that I mentioned
> above refer to planned polymorphism: return type is resolved during
> planning (and constant during execution).
>
> The biggest use-case IMO for polymorphism during planning is for functions
> that yield structures/lists of values (a-la collect_list) whose type can
> only be inferred after we know the functions' input type (array(f32) vs
> array(f64)), and whose implementation can be generalized via a macro +
> match.
>
> From a technical point of view, we currently have functions with variable
> types (all binary operators' return type depends on the lhs' type, sum,
> max/min, etc.), and we have to handle the main planning challenges already.
> In this context, the questions are something like:
>
> 1. should we continue to have them or should we move away from them?
> 2.1 If not, what should we do with them? E.g. declare sum_i32, sum_i64,
> etc., that have a single return type?
> 2.2 if yes, show we allow users to register these types of functions, or
> should these only be allowed within DataFusion's code base?
>
> Best,
> Jorge
>
>
>
> On Mon, Aug 17, 2020 at 9:53 PM Andrew Lamb <alamb@influxdata.com> wrote:
>
> > In my opinion, I suggest we do not continue down the path of (runtime)
> > polymorphic functions unless a compelling use case for them can be
> > articulated.
> >
> > You have done a great job articulating some of the implementation
> > challenges, but I personally struggle to describe when, as a user of
> > DataFusion, I would want to write a (runtime) polymorphic function.
> >
> > A function with runtime polymorphism I think would mean the UDF could
> > handle the type changing *at runtime*: record batches could come in with
> > multiple different types during the same execution. I can't think of
> > examples where this behavior would be desirable or necessary.
> >
> > The existing DataFusion codebase seems to assume (reasonably in my
> opinion)
> > that the schema of each Logical / Physical plan node is known at planning
> > time and it does not change at runtime.
> >
> > Most query optimizers (and compilers for that matter) take advantage of
> > plan (compile) time type information to make runtime more efficient.
> Also,
> > it seems like other database / runtime systems such as mysql[1] and
> > postgres[2] require the UDF creator to explicitly specify the return type
> > as well. I think we should consider the simpler semantics of "1 return
> type
> > for each UDF" to make it easier on people writing UDFs as well as
> > simplifying the implementation of DataFusion itself.
> >
> > Andrew
> >
> > [1] https://dev.mysql.com/doc/refman/8.0/en/create-function-udf.html
> > [2] https://www.postgresql.org/docs/12/sql-createfunction.html
> >
> > On Mon, Aug 17, 2020 at 12:31 PM Jorge Cardoso Leitão <
> > jorgecarleitao@gmail.com> wrote:
> >
> > > Hi,
> > >
> > > Recently, I have been contributing to DataFusion, and I would like to
> > bring
> > > to your attention a question that I faced while PRing to DataFusion
> that
> > > IMO needs some alignment :)
> > >
> > > DataFusion supports scalar UDFs: functions that expect a type, return a
> > > type, and performs some operation on the data (a-la spark UDF).
> However,
> > > the execution engine is actually dynamically typed:
> > >
> > > * a scalar UDF receives an &[ArrayRef] that must be downcasted
> > accordingly
> > > * a scalar UDF must select the builder that matches its signature, so
> > that
> > > its return type matches the ArrayRef that it returns.
> > >
> > > This suggests that we can treat functions as polymorphic: as long as
> the
> > > function handles the different types (e.g. via match), we are good. We
> > > currently do not support multiple input types nor variable return types
> > in
> > > their function signatures.
> > >
> > > Our current (non-udf) scalar and aggregate functions are already
> > > polymorphic on both their input and return type: sum(i32) -> i64,
> > sum(f64)
> > > -> f64, "a + b". I have been working on PRs to support polymorphic
> > support
> > > to scalar UDFs (e.g. sqrt() can take float32 and float64) [1,3], as
> well
> > as
> > > polymorphic aggregate UDFs [2], so that we can extend our offering to
> > more
> > > interesting functions such as "length(t) -> uint", "array(c1, c2)",
> > > "collect_list(t) -> array(t)", etc.
> > >
> > > However, while working on [1,2,3], I reach some non-trivial findings
> > that I
> > > would like to share:
> > >
> > > Finding 1: to support polymorphic functions, our logical and physical
> > > expressions (Expr and PhysicalExpr) need to be polymorphic as-well:
> once
> > a
> > > function is polymorphic, any expression containing it is also
> > polymorphic.
> > >
> > > Finding 2: when a polymorphic expression passes through our type
> coercer
> > > optimizer (that tries to coerce types to match a function's signature),
> > it
> > > may be re-casted to a different type. If the return type changes, the
> > > optimizer may need to re-cast operations dependent of the function call
> > > (e.g. a projection followed by an aggregation may need a recast on the
> > > projection and on the aggregation).
> > >
> > > Finding 3: when an expression passes through our type coercer optimizer
> > and
> > > is re-casted, its name changes (typically from "expr" to "CAST(expr as
> > > X)"). This implies that a column referenced as #expr down the plan may
> > not
> > > exist depending on the input type of the initial projection/scan.
> > >
> > > Finding 1 and 2 IMO are a direct consequence of polymorphism and the
> only
> > > way to not handle them is by not supporting polymorphism (e.g. the user
> > > registers sqrt_f32 and sqrt_f64, etc).
> > >
> > > Finding 3 can be addressed in at least three ways:
> > >
> > > A) make the optimizer rewrite the expression as "CAST(expr as X) AS
> > expr",
> > > so that it retains its original name. This hides the actual
> expression's
> > > calculation, but preserves its original name.
> > > B) accept that expressions can always change its name, which means that
> > the
> > > user should be mindful when writing `col("SELECT sqrt(x) FROM t"`, as
> the
> > > column name may end up being called `"sqrt(CAST(x as X))"`.
> > > C) Do not support polymorphic functions
> > >
> > > Note that we currently already experience effects 1-3, it is just that
> we
> > > use so few polymorphic functions that these seldomly present
> themselves.
> > It
> > > was while working on [1,2,3] that I start painting the bigger picture.
> > >
> > > Some questions:
> > > 1. should continue down the path of polymorphic functions?
> > > 2. if yes, how do handle finding 3?
> > >
> > > Looking at the current code base, I am confident that we can address
> the
> > > technical issues to support polymorphic functions. However, it would be
> > > interesting to have your thoughts on this.
> > >
> > > [1] https://github.com/apache/arrow/pull/7967
> > > [2] https://github.com/apache/arrow/pull/7971
> > > [3] https://github.com/apache/arrow/pull/7974
> > >
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message