spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ted Yu <yuzhih...@gmail.com>
Subject Re: Get statistic result from RDD
Date Tue, 20 Oct 2015 22:46:16 GMT
Please take a look at:
examples/src/main/scala/org/apache/spark/examples/streaming/HdfsWordCount.scala

Cheers

On Tue, Oct 20, 2015 at 3:18 PM, ChengBo <Cheng.Bo@huawei.com> wrote:

> Thanks, but I still don’t get it.
>
> I have used groupBy to group data by userID, and for each ID, I need to
> get the statistic information.
>
>
>
> Best
>
> Frank
>
>
>
> *From:* Ted Yu [mailto:yuzhihong@gmail.com]
> *Sent:* Tuesday, October 20, 2015 3:12 PM
> *To:* ChengBo
> *Cc:* user
> *Subject:* Re: Get statistic result from RDD
>
>
>
> Your mapValues can emit a tuple. If p(0) is between 0 and 5, first
> component of tuple would be 1, second being 0.
>
> If p(0) is 6 or 7, first component of tuple would be 0, second being 1.
>
>
> You can use reduceByKey to sum up corresponding component.
>
>
>
> On Tue, Oct 20, 2015 at 1:33 PM, Shepherd <Cheng.Bo@huawei.com> wrote:
>
> Hi all,
>
> I am really newie in Spark and Scala.
> I cannot get the statistic result from a RDD. Is someone could help me on
> this?
> Current code is as follows:
>
> /import org.apache.spark.SparkConf
> import org.apache.spark.SparkContext
> import org.apache.spark.SparkContext._
>
> val sqlContext = new org.apache.spark.sql.SQLContext(sc)
> import sqlContext.implicits._
>
> val webFile = sc.textFile("/home/Dataset/web_info.csv")
> webFile.cache()
> val webItem = webFile.map(line => line.split(","))
> val webEachRDD = webItem.map(p => (p(0).toLong, p(1).toLong, p(2).toLong,
> p(3), p(5))) //Int, Int, Int, String, String; p(3) here is user ID, and
> each
> user ID wil have multiple rows.
>
> val webGroup = webEachRDD.groupBy(_._4)
>
> val res = webGroup.mapValues(v => {
>         ....
>         (wkd.count, wknd.count)
> })
>
> /How can I write the webGroup.mapValues, so that I could each user ID's
> statistic information.
> For example: p(0) is an int between 0 to 7.
> I wish to get the result for each userID, how many 0 to 5 in p(0), and how
> many 6 to 7 in p(0).
> In the final result, each row represents each userID's statistic result.
>
> Thanks a lot. I really appreciate it.
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/Get-statistic-result-from-RDD-tp25147.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
> For additional commands, e-mail: user-help@spark.apache.org
>
>
>

Mime
View raw message