Thanks, but I still don’t get it.

I have used groupBy to group data by userID, and for each ID, I need to get the statistic information.





Your mapValues can emit a tuple. If p(0) is between 0 and 5, first component of tuple would be 1, second being 0.

If p(0) is 6 or 7, first component of tuple would be 0, second being 1.

You can use reduceByKey to sum up corresponding component.


Hi all,

I am really newie in Spark and Scala.
I cannot get the statistic result from a RDD. Is someone could help me on
Current code is as follows:

/import org.apache.spark.SparkConf
import org.apache.spark.SparkContext
import org.apache.spark.SparkContext._

val sqlContext = new org.apache.spark.sql.SQLContext(sc)
import sqlContext.implicits._

val webFile = sc.textFile("/home/Dataset/web_info.csv")
val webItem = => line.split(","))
val webEachRDD = => (p(0).toLong, p(1).toLong, p(2).toLong,
p(3), p(5))) //Int, Int, Int, String, String; p(3) here is user ID, and each
user ID wil have multiple rows.

val webGroup = webEachRDD.groupBy(_._4)

val res = webGroup.mapValues(v => {
        (wkd.count, wknd.count)

/How can I write the webGroup.mapValues, so that I could each user ID's
statistic information.
For example: p(0) is an int between 0 to 7.
I wish to get the result for each userID, how many 0 to 5 in p(0), and how
many 6 to 7 in p(0).
In the final result, each row represents each userID's statistic result.

Thanks a lot. I really appreciate it.

