spark-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Sean Owen (JIRA)" <>
Subject [jira] [Resolved] (SPARK-22568) Split pair RDDs by keys - an efficient (maybe?) substitute to groupByKey
Date Tue, 21 Nov 2017 03:00:00 GMT


Sean Owen resolved SPARK-22568.
    Resolution: Not A Problem

I think this is more of a usage question, so belongs on the mailing list.
You can indeed filter by each distinct key individually; this doesn't mean calling collect().
You can already group by the key. You can hash by the key and sort within partitions in one
operation, which lets you encounter all values for a key at a time while traversing partitions.
I think there are plenty of tools to do the kind of thing you mention already.

> Split pair RDDs by keys - an efficient (maybe?) substitute to groupByKey
> ------------------------------------------------------------------------
>                 Key: SPARK-22568
>                 URL:
>             Project: Spark
>          Issue Type: New Feature
>          Components: Spark Core
>    Affects Versions: 2.2.0
>            Reporter: Éderson Cássio
>              Labels: features, performance, usability
> Sorry for any mistakes on filling this big form... it's my first issue here :)
> Recently, I have the need to separate a RDD by some categorization. I was able to accomplish
that by some ways.
> First, the obvious: mapping each element to a pair, with the key being the category of
the element. Then, using the good ol' {{groupByKey}}.
> Listening to advices to avoid {{groupByKey}}, I failed to find another way that was more
efficient. I ended up (a) obtaining the distinct list of element categories, (b) {{collect}}
ing them and (c) making a call to {{filter}} for each category. Of course, before all I {{cache}}
d my initial RDD.
> So, I started to speculate: maybe it would be possible to make a number of RDDs from
an initial pair RDD _without the need to shuffle the data_. It could be made by a kind of
_local repartition_: first each partition is splitted into various by key; then the master
group the partitions with the same key into a new RDD. The operation returns a List or array
containing the new RDDs.
> It's just a conjecture, I don't know if it would be feasible in current Spark Core architecture.
But it would be great if it could be done.

This message was sent by Atlassian JIRA

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message