Please take a look at:


On Tue, Oct 20, 2015 at 3:18 PM, ChengBo <> wrote:

Thanks, but I still don’t get it.

I have used groupBy to group data by userID, and for each ID, I need to get the statistic information.





From: Ted Yu []
Sent: Tuesday, October 20, 2015 3:12 PM
To: ChengBo
Cc: user
Subject: Re: Get statistic result from RDD


Your mapValues can emit a tuple. If p(0) is between 0 and 5, first component of tuple would be 1, second being 0.

If p(0) is 6 or 7, first component of tuple would be 0, second being 1.

You can use reduceByKey to sum up corresponding component.


On Tue, Oct 20, 2015 at 1:33 PM, Shepherd <> wrote:

Hi all,

I am really newie in Spark and Scala.
I cannot get the statistic result from a RDD. Is someone could help me on
Current code is as follows:

/import org.apache.spark.SparkConf
import org.apache.spark.SparkContext
import org.apache.spark.SparkContext._

val sqlContext = new org.apache.spark.sql.SQLContext(sc)
import sqlContext.implicits._

val webFile = sc.textFile("/home/Dataset/web_info.csv")
val webItem = => line.split(","))
val webEachRDD = => (p(0).toLong, p(1).toLong, p(2).toLong,
p(3), p(5))) //Int, Int, Int, String, String; p(3) here is user ID, and each
user ID wil have multiple rows.

val webGroup = webEachRDD.groupBy(_._4)

val res = webGroup.mapValues(v => {
        (wkd.count, wknd.count)

/How can I write the webGroup.mapValues, so that I could each user ID's
statistic information.
For example: p(0) is an int between 0 to 7.
I wish to get the result for each userID, how many 0 to 5 in p(0), and how
many 6 to 7 in p(0).
In the final result, each row represents each userID's statistic result.

Thanks a lot. I really appreciate it.

View this message in context:
Sent from the Apache Spark User List mailing list archive at

To unsubscribe, e-mail:
For additional commands, e-mail: