spark-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Reynold Xin <r...@databricks.com>
Subject Re: [VOTE][SPIP] SPARK-21190: Vectorized UDFs in Python
Date Fri, 01 Sep 2017 09:52:00 GMT
Is the idea aggregate is out of scope for the current effort and we will be
adding those later?

On Fri, Sep 1, 2017 at 8:01 AM Takuya UESHIN <ueshin@happy-camper.st> wrote:

> Hi all,
>
> We've been discussing to support vectorized UDFs in Python and we almost
> got a consensus about the APIs, so I'd like to summarize and call for a
> vote.
>
> Note that this vote should focus on APIs for vectorized UDFs, not APIs for
> vectorized UDAFs or Window operations.
>
> https://issues.apache.org/jira/browse/SPARK-21190
>
>
> *Proposed API*
>
> We introduce a @pandas_udf decorator (or annotation) to define vectorized
> UDFs which takes one or more pandas.Series or one integer value meaning
> the length of the input value for 0-parameter UDFs. The return value should
> be pandas.Series of the specified type and the length of the returned
> value should be the same as input value.
>
> We can define vectorized UDFs as:
>
>   @pandas_udf(DoubleType())
>   def plus(v1, v2):
>       return v1 + v2
>
> or we can define as:
>
>   plus = pandas_udf(lambda v1, v2: v1 + v2, DoubleType())
>
> We can use it similar to row-by-row UDFs:
>
>   df.withColumn('sum', plus(df.v1, df.v2))
>
> As for 0-parameter UDFs, we can define and use as:
>
>   @pandas_udf(LongType())
>   def f0(size):
>       return pd.Series(1).repeat(size)
>
>   df.select(f0())
>
>
>
> The vote will be up for the next 72 hours. Please reply with your vote:
>
> +1: Yeah, let's go forward and implement the SPIP.
> +0: Don't really care.
> -1: I don't think this is a good idea because of the following technical
> reasons.
>
> Thanks!
>
> --
> Takuya UESHIN
> Tokyo, Japan
>
> http://twitter.com/ueshin
>

Mime
View raw message