spark-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Narine Kokhlikyan (JIRA)" <>
Subject [jira] [Commented] (SPARK-12922) Implement gapply() on DataFrame in SparkR
Date Sun, 17 Apr 2016 21:23:25 GMT


Narine Kokhlikyan commented on SPARK-12922:

Hi [~sunrui],

I’ve made some progress in putting logical and physical plans together and calling R workers,
however I still have some questions.
1. I’m still not quite sure about the number of partitions. As you wrote in
we need to 
    tune the number of partitions based on “spark.sql.shuffle.partitions”. What do you
exactly mean by tuning? Repartitioning ?
2.   I have another question about grouping by keys:
      groupByKey with one key is fine, however if we have more than one key we probably need
to introduce a case class. With a case
      class it looks okay too, but I’m not sure how convenient it is. Any ideas ?
      case class KeyData(a: Int, b: Int)
      val gd1 = df.groupByKey(r=>KeyData(r.getInt(0), r.getInt(1)))


> Implement gapply() on DataFrame in SparkR
> -----------------------------------------
>                 Key: SPARK-12922
>                 URL:
>             Project: Spark
>          Issue Type: Sub-task
>          Components: SparkR
>    Affects Versions: 1.6.0
>            Reporter: Sun Rui
> gapply() applies an R function on groups grouped by one or more columns of a DataFrame,
and returns a DataFrame. It is like GroupedDataSet.flatMapGroups() in the Dataset API.
> Two API styles are supported:
> 1.
> {code}
> gd <- groupBy(df, col1, ...)
> gapply(gd, function(grouping_key, group) {}, schema)
> {code}
> 2.
> {code}
> gapply(df, grouping_columns, function(grouping_key, group) {}, schema) 
> {code}
> R function input: grouping keys value, a local data.frame of this grouped data 
> R function output: local data.frame
> Schema specifies the Row format of the output of the R function. It must match the R
function's output.
> Note that map-side combination (partial aggregation) is not supported, user could do
map-side combination via dapply().

This message was sent by Atlassian JIRA

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message