spark-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Liang-Chi Hsieh <vii...@gmail.com>
Subject Re: Aggregating over sorted data
Date Wed, 21 Dec 2016 02:58:21 GMT
Hi,

Can you try the combination of `repartition` + `sortWithinPartitions` on the
dataset?

E.g.,

    val df = Seq((2, "b c a"), (1, "c a b"), (3, "a c b")).toDF("number",
"letters")
    val df2 =
      df.explode('letters) {
        case Row(letters: String) => letters.split(" ").map(Tuple1(_)).toSeq
      }

    df2
      .select('number, '_1 as 'letter)
      .repartition('number)
      .sortWithinPartitions('number, 'letter)
      .groupBy('number)
      .agg(collect_list('letter))
      .show()

    +------+--------------------+
    |number|collect_list(letter)|
    +------+--------------------+
    |     3|           [a, b, c]|
    |     1|           [a, b, c]|
    |     2|           [a, b, c]|
    +------+--------------------+

I think it should let you do aggregate on sorted data per key.




-----
Liang-Chi Hsieh | @viirya 
Spark Technology Center 
http://www.spark.tc/ 
--
View this message in context: http://apache-spark-developers-list.1001551.n3.nabble.com/Aggregating-over-sorted-data-tp19999p20310.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe e-mail: dev-unsubscribe@spark.apache.org


Mime
View raw message