spark-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Sean Owen <>
Subject Re: Sorting partitions in Java
Date Tue, 20 May 2014 14:14:49 GMT
It's an Iterator in both Java and Scala. In both cases you need to
copy the stream of values into something List-like to sort it. An
Iterable would not change that (not sure the API can promise many
iterations anyway).

If you just want the equivalent of "toArray", you can use a utility
method in Commons Collections or Guava. Guava's
Lists.newArrayList(Iterator) does nicely, which you can then
Collections.sort() with a Comparator and the return its iterator()

I dug this up too, remembering a similar question:

On Tue, May 20, 2014 at 2:25 PM, Madhu <> wrote:
> I'm trying to sort data in each partition of an RDD.
> I was able to do it successfully in Scala like this:
> val sorted = rdd.mapPartitions(iter => {
>   iter.toArray.sortWith((x, y) => < 0).iterator
> },
> preservesPartitioning = true)
> I used the same technique as in OrderedRDDFunctions.scala, so I assume it's
> a reasonable way to do it.
> This works well so far, but I can't seem to do the same thing in Java
> because 'iter' in the Java APIs is an Iterator rather than an Iterable.
> There may be an unattractive workaround, but I didn't pursue it.
> Ideally, it would be nice to have an efficient, robust method in RDD to sort
> each partition.
> Does something like that exist?
> Thanks!
> -----
> --
> Madhu
> --
> View this message in context:
> Sent from the Apache Spark Developers List mailing list archive at

View raw message