spark-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Sameer Agarwal <sam...@databricks.com>
Subject Re: [VOTE][SPIP] SPARK-21190: Vectorized UDFs in Python
Date Tue, 12 Sep 2017 00:47:54 GMT
+1 (non-binding)

On Thu, Sep 7, 2017 at 9:10 PM, Bryan Cutler <cutlerb@gmail.com> wrote:

> +1 (non-binding) for the goals and non-goals of this SPIP.  I think it's
> fine to work out the minor details of the API during review.
>
> Bryan
>
> On Wed, Sep 6, 2017 at 5:17 AM, Takuya UESHIN <ueshin@happy-camper.st>
> wrote:
>
>> Hi all,
>>
>> Thank you for voting and suggestions.
>>
>> As Wenchen mentioned and also we're discussing at JIRA, we need to
>> discuss the size hint for the 0-parameter UDF.
>> But I believe we got a consensus about the basic APIs except for the size
>> hint, I'd like to submit a pr based on the current proposal and continue
>> discussing in its review.
>>
>>     https://github.com/apache/spark/pull/19147
>>
>> I'd keep this vote open to wait for more opinions.
>>
>> Thanks.
>>
>>
>> On Wed, Sep 6, 2017 at 9:48 AM, Wenchen Fan <cloud0fan@gmail.com> wrote:
>>
>>> +1 on the design and proposed API.
>>>
>>> One detail I'd like to discuss is the 0-parameter UDF, how we can
>>> specify the size hint. This can be done in the PR review though.
>>>
>>> On Sat, Sep 2, 2017 at 2:07 AM, Felix Cheung <felixcheung_m@hotmail.com>
>>> wrote:
>>>
>>>> +1 on this and like the suggestion of type in string form.
>>>>
>>>> Would it be correct to assume there will be data type check, for
>>>> example the returned pandas data frame column data types match what are
>>>> specified. We have seen quite a bit of issues/confusions with that in R.
>>>>
>>>> Would it make sense to have a more generic decorator name so that it
>>>> could also be useable for other efficient vectorized format in the future?
>>>> Or do we anticipate the decorator to be format specific and will have more
>>>> in the future?
>>>>
>>>> ------------------------------
>>>> *From:* Reynold Xin <rxin@databricks.com>
>>>> *Sent:* Friday, September 1, 2017 5:16:11 AM
>>>> *To:* Takuya UESHIN
>>>> *Cc:* spark-dev
>>>> *Subject:* Re: [VOTE][SPIP] SPARK-21190: Vectorized UDFs in Python
>>>>
>>>> Ok, thanks.
>>>>
>>>> +1 on the SPIP for scope etc
>>>>
>>>>
>>>> On API details (will deal with in code reviews as well but leaving a
>>>> note here in case I forget)
>>>>
>>>> 1. I would suggest having the API also accept data type specification
>>>> in string form. It is usually simpler to say "long" then "LongType()".
>>>>
>>>> 2. Think about what error message to show when the rows numbers don't
>>>> match at runtime.
>>>>
>>>>
>>>> On Fri, Sep 1, 2017 at 12:29 PM Takuya UESHIN <ueshin@happy-camper.st>
>>>> wrote:
>>>>
>>>>> Yes, the aggregation is out of scope for now.
>>>>> I think we should continue discussing the aggregation at JIRA and we
>>>>> will be adding those later separately.
>>>>>
>>>>> Thanks.
>>>>>
>>>>>
>>>>> On Fri, Sep 1, 2017 at 6:52 PM, Reynold Xin <rxin@databricks.com>
>>>>> wrote:
>>>>>
>>>>>> Is the idea aggregate is out of scope for the current effort and
we
>>>>>> will be adding those later?
>>>>>>
>>>>>> On Fri, Sep 1, 2017 at 8:01 AM Takuya UESHIN <ueshin@happy-camper.st>
>>>>>> wrote:
>>>>>>
>>>>>>> Hi all,
>>>>>>>
>>>>>>> We've been discussing to support vectorized UDFs in Python and
we
>>>>>>> almost got a consensus about the APIs, so I'd like to summarize
and
>>>>>>> call for a vote.
>>>>>>>
>>>>>>> Note that this vote should focus on APIs for vectorized UDFs,
not
>>>>>>> APIs for vectorized UDAFs or Window operations.
>>>>>>>
>>>>>>> https://issues.apache.org/jira/browse/SPARK-21190
>>>>>>>
>>>>>>>
>>>>>>> *Proposed API*
>>>>>>>
>>>>>>> We introduce a @pandas_udf decorator (or annotation) to define
>>>>>>> vectorized UDFs which takes one or more pandas.Series or one
>>>>>>> integer value meaning the length of the input value for 0-parameter
UDFs.
>>>>>>> The return value should be pandas.Series of the specified type
and
>>>>>>> the length of the returned value should be the same as input
value.
>>>>>>>
>>>>>>> We can define vectorized UDFs as:
>>>>>>>
>>>>>>>   @pandas_udf(DoubleType())
>>>>>>>   def plus(v1, v2):
>>>>>>>       return v1 + v2
>>>>>>>
>>>>>>> or we can define as:
>>>>>>>
>>>>>>>   plus = pandas_udf(lambda v1, v2: v1 + v2, DoubleType())
>>>>>>>
>>>>>>> We can use it similar to row-by-row UDFs:
>>>>>>>
>>>>>>>   df.withColumn('sum', plus(df.v1, df.v2))
>>>>>>>
>>>>>>> As for 0-parameter UDFs, we can define and use as:
>>>>>>>
>>>>>>>   @pandas_udf(LongType())
>>>>>>>   def f0(size):
>>>>>>>       return pd.Series(1).repeat(size)
>>>>>>>
>>>>>>>   df.select(f0())
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> The vote will be up for the next 72 hours. Please reply with
your
>>>>>>> vote:
>>>>>>>
>>>>>>> +1: Yeah, let's go forward and implement the SPIP.
>>>>>>> +0: Don't really care.
>>>>>>> -1: I don't think this is a good idea because of the following
technical
>>>>>>> reasons.
>>>>>>>
>>>>>>> Thanks!
>>>>>>>
>>>>>>> --
>>>>>>> Takuya UESHIN
>>>>>>> Tokyo, Japan
>>>>>>>
>>>>>>> http://twitter.com/ueshin
>>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> Takuya UESHIN
>>>>> Tokyo, Japan
>>>>>
>>>>> http://twitter.com/ueshin
>>>>>
>>>>
>>>
>>
>>
>> --
>> Takuya UESHIN
>> Tokyo, Japan
>>
>> http://twitter.com/ueshin
>>
>
>


-- 
Sameer Agarwal
Software Engineer | Databricks Inc.
http://cs.berkeley.edu/~sameerag

Mime
View raw message