spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jaka JanĨar <j...@kubje.org>
Subject Re: partitioning via groupByKey
Date Wed, 19 Mar 2014 16:42:03 GMT
The former: a single new RDD is returned.

Check the PairRDDFunctions docs (http://spark.apache.org/docs/latest/api/core/index.html#org.apache.spark.rdd.PairRDDFunctions):

def groupByKey(): RDD[(K, Seq[V])]
Group the values for each key in the RDD into a single sequence.


On Wednesday, March 19, 2014 at 9:32 AM, Adrian Mocanu wrote:

> When you partition via groupByKey tulpes (parts of the RDD) are moved from some node
to another node based on key (hash partitioning).
> Do the tuples remain part of 1 RDD as before but moved to different nodes or does this
shuffling create, say, several RDDs which will have parts of the original RDD?
>  
> Thanks
> -Adrian
>  
> 
> 
> 



Mime
View raw message