spark-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Reynold Xin (JIRA)" <j...@apache.org>
Subject [jira] [Comment Edited] (SPARK-6817) DataFrame UDFs in R
Date Sun, 30 Aug 2015 01:14:45 GMT

    [ https://issues.apache.org/jira/browse/SPARK-6817?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14717079#comment-14717079
] 

Reynold Xin edited comment on SPARK-6817 at 8/30/15 1:14 AM:
-------------------------------------------------------------

Here are some suggestions on the proposed API. If the idea is to keep the API close to R's
current primitives, we should avoid 
introducing too many new keywords. E.g., dapplyCollect can be expressed as collect(dapply(...)).
Since collect already exists in Spark,
and R users are comfortable with the syntax as part of dplyr, we shoud reuse the keyword instead
of introducing a new function dapplyCollect. 
Relying on existing syntax will reduce the learning curve for users. Was performance the primary
intent to introduce dapplyCollect instead of
collect(dapply(...))?

Similarly, can we do away with gapply and gapplyCollect, and express it using dapply? In R,
the function "split" provides grouping 
(https://stat.ethz.ch/R-manual/R-devel/library/base/html/split.html). One should be able to
implement "split" using GroupBy in Spark.
"gapply" can then be expressed in terms of dapply and split, and gapplyCollect will become
collect(dapply(..split..)). 
Here is a simple example that uses split and lapply in R:

{code}
df<-data.frame(city=c("A","B","A","D"), age=c(10,12,23,5))
print(df)
s<-split(df$age, df$city)
lapply(s, mean)
{code}


was (Author: indrajit):
Here are some suggestions on the proposed API. If the idea is to keep the API close to R's
current primitives, we should avoid 
introducing too many new keywords. E.g., dapplyCollect can be expressed as collect(dapply(...)).
Since collect already exists in Spark,
and R users are comfortable with the syntax as part of dplyr, we shoud reuse the keyword instead
of introducing a new function dapplyCollect. 
Relying on existing syntax will reduce the learning curve for users. Was performance the primary
intent to introduce dapplyCollect instead of
collect(dapply(...))?

Similarly, can we do away with gapply and gapplyCollect, and express it using dapply? In R,
the function "split" provides grouping 
(https://stat.ethz.ch/R-manual/R-devel/library/base/html/split.html). One should be able to
implement "split" using GroupBy in Spark.
"gapply" can then be expressed in terms of dapply and split, and gapplyCollect will become
collect(dapply(..split..)). 
Here is a simple example that uses split and lapply in R:

df<-data.frame(city=c("A","B","A","D"), age=c(10,12,23,5))
print(df)
s<-split(df$age, df$city)
lapply(s, mean)

> DataFrame UDFs in R
> -------------------
>
>                 Key: SPARK-6817
>                 URL: https://issues.apache.org/jira/browse/SPARK-6817
>             Project: Spark
>          Issue Type: New Feature
>          Components: SparkR, SQL
>            Reporter: Shivaram Venkataraman
>
> This depends on some internal interface of Spark SQL, should be done after merging into
Spark.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org


Mime
View raw message