spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From ayan guha <>
Subject Re: partitioner aware subtract
Date Mon, 09 May 2016 10:27:24 GMT
How about outer join?
On 9 May 2016 13:18, "Raghava Mutharaju" <> wrote:

> Hello All,
> We have two PairRDDs (rdd1, rdd2) which are hash partitioned on key
> (number of partitions are same for both the RDDs). We would like to
> subtract rdd2 from rdd1.
> The subtract code at
> seems to group the elements of both the RDDs using (x, null) where x is the
> element of the RDD and partition them. Then it makes use of
> subtractByKey(). This way, RDDs have to be repartitioned on x (which in our
> case, is both key and value combined). In our case, both the RDDs are
> already hash partitioned on the key of x. Can we take advantage of this by
> having a PairRDD/HashPartitioner-aware subtract? Is there a way to use
> mapPartitions() for this?
> We tried to broadcast rdd2 and use mapPartitions. But this turns out to be
> memory consuming and inefficient. We tried to do a local set difference
> between rdd1 and the broadcasted rdd2 (in mapPartitions of rdd1). We did
> use destroy() on the broadcasted value, but it does not help.
> The current subtract method is slow for us. rdd1 and rdd2 are around 700MB
> each and the subtract takes around 14 seconds.
> Any ideas on this issue is highly appreciated.
> Regards,
> Raghava.

View raw message