spark-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "koert kuipers (JIRA)" <>
Subject [jira] [Commented] (SPARK-3655) Support sorting of values in addition to keys (i.e. secondary sort)
Date Sun, 07 Dec 2014 22:34:12 GMT


koert kuipers commented on SPARK-3655:

i have a new pullreq that implements just groupByKeyAndSortValues in scala and java. i will
need some help with python.

pullreq is here:

i changed methods to return RDD[(K, TraversableOnce[V])] instead of RDD[(K, Iterable[V])],
since i dont see a reasonable way to implement it so that it returns Iterables without resorting
to keeping the data in memory.
The assumption made is that once you move on to the next key within a partition that the previous
value (so the TraversableOnce[V]) will no longer be used.

I personally find this API too generic, and too easy to abuse or make mistakes with. So i
prefer a more constrained API like foldLeft.

> Support sorting of values in addition to keys (i.e. secondary sort)
> -------------------------------------------------------------------
>                 Key: SPARK-3655
>                 URL:
>             Project: Spark
>          Issue Type: New Feature
>          Components: Spark Core
>    Affects Versions: 1.1.0
>            Reporter: koert kuipers
>            Assignee: Koert Kuipers
>            Priority: Minor
> Now that spark has a sort based shuffle, can we expect a secondary sort soon? There are
some use cases where getting a sorted iterator of values per key is helpful.

This message was sent by Atlassian JIRA

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message