spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Silvio Fiorito <silvio.fior...@granturing.com>
Subject Re: workaround for groupByKey
Date Mon, 22 Jun 2015 21:43:32 GMT
You can use aggregateByKey as one option:

val input: RDD[Int, String] = ...

val test = input.aggregateByKey(ListBuffer.empty[String])((a, b) => a += b, (a, b) =>
a ++ b)

From: Jianguo Li
Date: Monday, June 22, 2015 at 5:12 PM
To: "user@spark.apache.org<mailto:user@spark.apache.org>"
Subject: workaround for groupByKey

Hi,

I am processing an RDD of key-value pairs. The key is an user_id, and the value is an website
url the user has ever visited.

Since I need to know all the urls each user has visited, I am  tempted to call the groupByKey
on this RDD. However, since there could be millions of users and urls, the shuffling caused
by groupByKey proves to be a major bottleneck to get the job done. Is there any workaround?
I want to end up with an RDD of key-value pairs, where the key is an user_id, the value is
a list of all the urls visited by the user.

Thanks,

Jianguo
Mime
View raw message