spark-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Sean Owen <so...@cloudera.com>
Subject Re: Sorting partitions in Java
Date Tue, 20 May 2014 14:14:49 GMT
It's an Iterator in both Java and Scala. In both cases you need to
copy the stream of values into something List-like to sort it. An
Iterable would not change that (not sure the API can promise many
iterations anyway).

If you just want the equivalent of "toArray", you can use a utility
method in Commons Collections or Guava. Guava's
Lists.newArrayList(Iterator) does nicely, which you can then
Collections.sort() with a Comparator and the return its iterator()

I dug this up too, remembering a similar question:
http://mail-archives.apache.org/mod_mbox/spark-user/201312.mbox/%3C529F819F.3060901@vu.nl%3E

On Tue, May 20, 2014 at 2:25 PM, Madhu <madhu@madhu.com> wrote:
> I'm trying to sort data in each partition of an RDD.
> I was able to do it successfully in Scala like this:
>
> val sorted = rdd.mapPartitions(iter => {
>   iter.toArray.sortWith((x, y) => x._2.compare(y._2) < 0).iterator
> },
> preservesPartitioning = true)
>
> I used the same technique as in OrderedRDDFunctions.scala, so I assume it's
> a reasonable way to do it.
>
> This works well so far, but I can't seem to do the same thing in Java
> because 'iter' in the Java APIs is an Iterator rather than an Iterable.
> There may be an unattractive workaround, but I didn't pursue it.
>
> Ideally, it would be nice to have an efficient, robust method in RDD to sort
> each partition.
> Does something like that exist?
>
> Thanks!
>
>
>
> -----
> --
> Madhu
> https://www.linkedin.com/in/msiddalingaiah
> --
> View this message in context: http://apache-spark-developers-list.1001551.n3.nabble.com/Sorting-partitions-in-Java-tp6715.html
> Sent from the Apache Spark Developers List mailing list archive at Nabble.com.

Mime
View raw message