spark-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Takuya UESHIN <ues...@happy-camper.st>
Subject Re: [VOTE][SPIP] SPARK-21190: Vectorized UDFs in Python
Date Tue, 12 Sep 2017 08:04:40 GMT
This vote passes with 4 binding +1 votes, 6 non-binding votes, no +0 vote,
and no -1 votes.

Thanks all!

+1 votes (binding):
Reynold Xin
Wenchen Fan
Yin Huai
Matei Zaharia


+1 votes (non-binding):
Felix Cheung
Bryan Cutler
Sameer Agarwal
Hyukjin Kwon
Xiao Li
Liang-Chi Hsieh



On Tue, Sep 12, 2017 at 11:46 AM, Liang-Chi Hsieh <viirya@gmail.com> wrote:

> +1
>
>
> Xiao Li wrote
> > +1
> >
> > Xiao
> > On Mon, 11 Sep 2017 at 6:44 PM Matei Zaharia &lt;
>
> > matei.zaharia@
>
> > &gt;
> > wrote:
> >
> >> +1 (binding)
> >>
> >> > On Sep 11, 2017, at 5:54 PM, Hyukjin Kwon &lt;
>
> > gurwls223@
>
> > &gt; wrote:
> >> >
> >> > +1 (non-binding)
> >> >
> >> >
> >> > 2017-09-12 9:52 GMT+09:00 Yin Huai &lt;
>
> > yhuai@
>
> > &gt;:
> >> > +1
> >> >
> >> > On Mon, Sep 11, 2017 at 5:47 PM, Sameer Agarwal &lt;
>
> > sameer@
>
> > &gt;
> >> wrote:
> >> > +1 (non-binding)
> >> >
> >> > On Thu, Sep 7, 2017 at 9:10 PM, Bryan Cutler &lt;
>
> > cutlerb@
>
> > &gt; wrote:
> >> > +1 (non-binding) for the goals and non-goals of this SPIP.  I think
> >> it's
> >> fine to work out the minor details of the API during review.
> >> >
> >> > Bryan
> >> >
> >> > On Wed, Sep 6, 2017 at 5:17 AM, Takuya UESHIN &lt;
>
> > ueshin@
>
> > &gt;
> >> wrote:
> >> > Hi all,
> >> >
> >> > Thank you for voting and suggestions.
> >> >
> >> > As Wenchen mentioned and also we're discussing at JIRA, we need to
> >> discuss the size hint for the 0-parameter UDF.
> >> > But I believe we got a consensus about the basic APIs except for the
> >> size hint, I'd like to submit a pr based on the current proposal and
> >> continue discussing in its review.
> >> >
> >> >     https://github.com/apache/spark/pull/19147
> >> >
> >> > I'd keep this vote open to wait for more opinions.
> >> >
> >> > Thanks.
> >> >
> >> >
> >> > On Wed, Sep 6, 2017 at 9:48 AM, Wenchen Fan &lt;
>
> > cloud0fan@
>
> > &gt; wrote:
> >> > +1 on the design and proposed API.
> >> >
> >> > One detail I'd like to discuss is the 0-parameter UDF, how we can
> >> specify the size hint. This can be done in the PR review though.
> >> >
> >> > On Sat, Sep 2, 2017 at 2:07 AM, Felix Cheung &lt;
>
> > felixcheung_m@
>
> > &gt;
> >> wrote:
> >> > +1 on this and like the suggestion of type in string form.
> >> >
> >> > Would it be correct to assume there will be data type check, for
> >> example
> >> the returned pandas data frame column data types match what are
> >> specified.
> >> We have seen quite a bit of issues/confusions with that in R.
> >> >
> >> > Would it make sense to have a more generic decorator name so that it
> >> could also be useable for other efficient vectorized format in the
> >> future?
> >> Or do we anticipate the decorator to be format specific and will have
> >> more
> >> in the future?
> >> >
> >> > From: Reynold Xin &lt;
>
> > rxin@
>
> > &gt;
> >> > Sent: Friday, September 1, 2017 5:16:11 AM
> >> > To: Takuya UESHIN
> >> > Cc: spark-dev
> >> > Subject: Re: [VOTE][SPIP] SPARK-21190: Vectorized UDFs in Python
> >> >
> >> > Ok, thanks.
> >> >
> >> > +1 on the SPIP for scope etc
> >> >
> >> >
> >> > On API details (will deal with in code reviews as well but leaving a
> >> note here in case I forget)
> >> >
> >> > 1. I would suggest having the API also accept data type specification
> >> in
> >> string form. It is usually simpler to say "long" then "LongType()".
> >> >
> >> > 2. Think about what error message to show when the rows numbers don't
> >> match at runtime.
> >> >
> >> >
> >> > On Fri, Sep 1, 2017 at 12:29 PM Takuya UESHIN &lt;
>
> > ueshin@
>
> > &gt;
> >> wrote:
> >> > Yes, the aggregation is out of scope for now.
> >> > I think we should continue discussing the aggregation at JIRA and we
> >> will be adding those later separately.
> >> >
> >> > Thanks.
> >> >
> >> >
> >> > On Fri, Sep 1, 2017 at 6:52 PM, Reynold Xin &lt;
>
> > rxin@
>
> > &gt; wrote:
> >> > Is the idea aggregate is out of scope for the current effort and we
> >> will
> >> be adding those later?
> >> >
> >> > On Fri, Sep 1, 2017 at 8:01 AM Takuya UESHIN &lt;
>
> > ueshin@
>
> > &gt;
> >> wrote:
> >> > Hi all,
> >> >
> >> > We've been discussing to support vectorized UDFs in Python and we
> >> almost
> >> got a consensus about the APIs, so I'd like to summarize and call for a
> >> vote.
> >> >
> >> > Note that this vote should focus on APIs for vectorized UDFs, not APIs
> >> for vectorized UDAFs or Window operations.
> >> >
> >> > https://issues.apache.org/jira/browse/SPARK-21190
> >> >
> >> >
> >> > Proposed API
> >> >
> >> > We introduce a @pandas_udf decorator (or annotation) to define
> >> vectorized UDFs which takes one or more pandas.Series or one integer
> >> value
> >> meaning the length of the input value for 0-parameter UDFs. The return
> >> value should be pandas.Series of the specified type and the length of
> the
> >> returned value should be the same as input value.
> >> >
> >> > We can define vectorized UDFs as:
> >> >
> >> >   @pandas_udf(DoubleType())
> >> >   def plus(v1, v2):
> >> >       return v1 + v2
> >> >
> >> > or we can define as:
> >> >
> >> >   plus = pandas_udf(lambda v1, v2: v1 + v2, DoubleType())
> >> >
> >> > We can use it similar to row-by-row UDFs:
> >> >
> >> >   df.withColumn('sum', plus(df.v1, df.v2))
> >> >
> >> > As for 0-parameter UDFs, we can define and use as:
> >> >
> >> >   @pandas_udf(LongType())
> >> >   def f0(size):
> >> >       return pd.Series(1).repeat(size)
> >> >
> >> >   df.select(f0())
> >> >
> >> >
> >> >
> >> > The vote will be up for the next 72 hours. Please reply with your
> vote:
> >> >
> >> > +1: Yeah, let's go forward and implement the SPIP.
> >> > +0: Don't really care.
> >> > -1: I don't think this is a good idea because of the following
> >> technical
> >> reasons.
> >> >
> >> > Thanks!
> >> >
> >> > --
> >> > Takuya UESHIN
> >> > Tokyo, Japan
> >> >
> >> > http://twitter.com/ueshin
> >> >
> >> >
> >> >
> >> > --
> >> > Takuya UESHIN
> >> > Tokyo, Japan
> >> >
> >> > http://twitter.com/ueshin
> >> >
> >> >
> >> >
> >> >
> >> > --
> >> > Takuya UESHIN
> >> > Tokyo, Japan
> >> >
> >> > http://twitter.com/ueshin
> >> >
> >> >
> >> >
> >> >
> >> > --
> >> > Sameer Agarwal
> >> > Software Engineer | Databricks Inc.
> >> > http://cs.berkeley.edu/~sameerag
> >> >
> >> >
> >>
> >>
> >> ---------------------------------------------------------------------
> >> To unsubscribe e-mail:
>
> > dev-unsubscribe@.apache
>
> >>
> >>
>
>
>
>
>
> -----
> Liang-Chi Hsieh | @viirya
> Spark Technology Center
> http://www.spark.tc/
> --
> Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/
>
> ---------------------------------------------------------------------
> To unsubscribe e-mail: dev-unsubscribe@spark.apache.org
>
>


-- 
Takuya UESHIN
Tokyo, Japan

http://twitter.com/ueshin

Mime
View raw message