spark-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Liang-Chi Hsieh <vii...@gmail.com>
Subject Re: [VOTE][SPIP] SPARK-21190: Vectorized UDFs in Python
Date Tue, 12 Sep 2017 02:46:27 GMT
+1


Xiao Li wrote
> +1
> 
> Xiao
> On Mon, 11 Sep 2017 at 6:44 PM Matei Zaharia &lt;

> matei.zaharia@

> &gt;
> wrote:
> 
>> +1 (binding)
>>
>> > On Sep 11, 2017, at 5:54 PM, Hyukjin Kwon &lt;

> gurwls223@

> &gt; wrote:
>> >
>> > +1 (non-binding)
>> >
>> >
>> > 2017-09-12 9:52 GMT+09:00 Yin Huai &lt;

> yhuai@

> &gt;:
>> > +1
>> >
>> > On Mon, Sep 11, 2017 at 5:47 PM, Sameer Agarwal &lt;

> sameer@

> &gt;
>> wrote:
>> > +1 (non-binding)
>> >
>> > On Thu, Sep 7, 2017 at 9:10 PM, Bryan Cutler &lt;

> cutlerb@

> &gt; wrote:
>> > +1 (non-binding) for the goals and non-goals of this SPIP.  I think
>> it's
>> fine to work out the minor details of the API during review.
>> >
>> > Bryan
>> >
>> > On Wed, Sep 6, 2017 at 5:17 AM, Takuya UESHIN &lt;

> ueshin@

> &gt;
>> wrote:
>> > Hi all,
>> >
>> > Thank you for voting and suggestions.
>> >
>> > As Wenchen mentioned and also we're discussing at JIRA, we need to
>> discuss the size hint for the 0-parameter UDF.
>> > But I believe we got a consensus about the basic APIs except for the
>> size hint, I'd like to submit a pr based on the current proposal and
>> continue discussing in its review.
>> >
>> >     https://github.com/apache/spark/pull/19147
>> >
>> > I'd keep this vote open to wait for more opinions.
>> >
>> > Thanks.
>> >
>> >
>> > On Wed, Sep 6, 2017 at 9:48 AM, Wenchen Fan &lt;

> cloud0fan@

> &gt; wrote:
>> > +1 on the design and proposed API.
>> >
>> > One detail I'd like to discuss is the 0-parameter UDF, how we can
>> specify the size hint. This can be done in the PR review though.
>> >
>> > On Sat, Sep 2, 2017 at 2:07 AM, Felix Cheung &lt;

> felixcheung_m@

> &gt;
>> wrote:
>> > +1 on this and like the suggestion of type in string form.
>> >
>> > Would it be correct to assume there will be data type check, for
>> example
>> the returned pandas data frame column data types match what are
>> specified.
>> We have seen quite a bit of issues/confusions with that in R.
>> >
>> > Would it make sense to have a more generic decorator name so that it
>> could also be useable for other efficient vectorized format in the
>> future?
>> Or do we anticipate the decorator to be format specific and will have
>> more
>> in the future?
>> >
>> > From: Reynold Xin &lt;

> rxin@

> &gt;
>> > Sent: Friday, September 1, 2017 5:16:11 AM
>> > To: Takuya UESHIN
>> > Cc: spark-dev
>> > Subject: Re: [VOTE][SPIP] SPARK-21190: Vectorized UDFs in Python
>> >
>> > Ok, thanks.
>> >
>> > +1 on the SPIP for scope etc
>> >
>> >
>> > On API details (will deal with in code reviews as well but leaving a
>> note here in case I forget)
>> >
>> > 1. I would suggest having the API also accept data type specification
>> in
>> string form. It is usually simpler to say "long" then "LongType()".
>> >
>> > 2. Think about what error message to show when the rows numbers don't
>> match at runtime.
>> >
>> >
>> > On Fri, Sep 1, 2017 at 12:29 PM Takuya UESHIN &lt;

> ueshin@

> &gt;
>> wrote:
>> > Yes, the aggregation is out of scope for now.
>> > I think we should continue discussing the aggregation at JIRA and we
>> will be adding those later separately.
>> >
>> > Thanks.
>> >
>> >
>> > On Fri, Sep 1, 2017 at 6:52 PM, Reynold Xin &lt;

> rxin@

> &gt; wrote:
>> > Is the idea aggregate is out of scope for the current effort and we
>> will
>> be adding those later?
>> >
>> > On Fri, Sep 1, 2017 at 8:01 AM Takuya UESHIN &lt;

> ueshin@

> &gt;
>> wrote:
>> > Hi all,
>> >
>> > We've been discussing to support vectorized UDFs in Python and we
>> almost
>> got a consensus about the APIs, so I'd like to summarize and call for a
>> vote.
>> >
>> > Note that this vote should focus on APIs for vectorized UDFs, not APIs
>> for vectorized UDAFs or Window operations.
>> >
>> > https://issues.apache.org/jira/browse/SPARK-21190
>> >
>> >
>> > Proposed API
>> >
>> > We introduce a @pandas_udf decorator (or annotation) to define
>> vectorized UDFs which takes one or more pandas.Series or one integer
>> value
>> meaning the length of the input value for 0-parameter UDFs. The return
>> value should be pandas.Series of the specified type and the length of the
>> returned value should be the same as input value.
>> >
>> > We can define vectorized UDFs as:
>> >
>> >   @pandas_udf(DoubleType())
>> >   def plus(v1, v2):
>> >       return v1 + v2
>> >
>> > or we can define as:
>> >
>> >   plus = pandas_udf(lambda v1, v2: v1 + v2, DoubleType())
>> >
>> > We can use it similar to row-by-row UDFs:
>> >
>> >   df.withColumn('sum', plus(df.v1, df.v2))
>> >
>> > As for 0-parameter UDFs, we can define and use as:
>> >
>> >   @pandas_udf(LongType())
>> >   def f0(size):
>> >       return pd.Series(1).repeat(size)
>> >
>> >   df.select(f0())
>> >
>> >
>> >
>> > The vote will be up for the next 72 hours. Please reply with your vote:
>> >
>> > +1: Yeah, let's go forward and implement the SPIP.
>> > +0: Don't really care.
>> > -1: I don't think this is a good idea because of the following
>> technical
>> reasons.
>> >
>> > Thanks!
>> >
>> > --
>> > Takuya UESHIN
>> > Tokyo, Japan
>> >
>> > http://twitter.com/ueshin
>> >
>> >
>> >
>> > --
>> > Takuya UESHIN
>> > Tokyo, Japan
>> >
>> > http://twitter.com/ueshin
>> >
>> >
>> >
>> >
>> > --
>> > Takuya UESHIN
>> > Tokyo, Japan
>> >
>> > http://twitter.com/ueshin
>> >
>> >
>> >
>> >
>> > --
>> > Sameer Agarwal
>> > Software Engineer | Databricks Inc.
>> > http://cs.berkeley.edu/~sameerag
>> >
>> >
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe e-mail: 

> dev-unsubscribe@.apache

>>
>>





-----
Liang-Chi Hsieh | @viirya 
Spark Technology Center 
http://www.spark.tc/ 
--
Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/

---------------------------------------------------------------------
To unsubscribe e-mail: dev-unsubscribe@spark.apache.org


Mime
View raw message