spark-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Sandy Ryza (JIRA)" <j...@apache.org>
Subject [jira] [Updated] (SPARK-2978) Provide an MR-style shuffle transformation
Date Tue, 12 Aug 2014 01:02:39 GMT

     [ https://issues.apache.org/jira/browse/SPARK-2978?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Sandy Ryza updated SPARK-2978:
------------------------------

    Description: 
For Hive on Spark joins in particular, and for running legacy MR code in general, I think
it would be useful to provide a transformation with the semantics of the Hadoop MR shuffle,
i.e. one that
* groups by key: provides (Key, Iterator[Value])
* within each partition, provides keys in sorted order

A couple ways that could make sense to expose this:
* Add a new operator.  "groupAndSortByKey", "groupByKeyAndSortWithinPartition", "hadoopStyleShuffle",
maybe?
* Allow groupByKey to take an ordering param for keys within a partition

  was:
For Hive on Spark joins in particular, and for running legacy MR code in general, I think
it would be useful to provide a transformation with the semantics of the Hadoop MR shuffle,
i.e. one that
* groups by key: provides (Key, Iterator[Value])
* within each partition, provides keys in sorted order

A couple ways that could make sense to expose this:
* Add a new operator.  "groupAndSortByKey", "groupByKeyAndSortWithinPartition", "hadoopStyleShuffle"
* Allow groupByKey to take an ordering param for keys within a partition


> Provide an MR-style shuffle transformation
> ------------------------------------------
>
>                 Key: SPARK-2978
>                 URL: https://issues.apache.org/jira/browse/SPARK-2978
>             Project: Spark
>          Issue Type: New Feature
>          Components: Spark Core
>            Reporter: Sandy Ryza
>
> For Hive on Spark joins in particular, and for running legacy MR code in general, I think
it would be useful to provide a transformation with the semantics of the Hadoop MR shuffle,
i.e. one that
> * groups by key: provides (Key, Iterator[Value])
> * within each partition, provides keys in sorted order
> A couple ways that could make sense to expose this:
> * Add a new operator.  "groupAndSortByKey", "groupByKeyAndSortWithinPartition", "hadoopStyleShuffle",
maybe?
> * Allow groupByKey to take an ordering param for keys within a partition



--
This message was sent by Atlassian JIRA
(v6.2#6252)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org


Mime
View raw message