spark-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Liang-Chi Hsieh <vii...@gmail.com>
Subject Re: Equally split a RDD partition into two partition at the same node
Date Mon, 16 Jan 2017 08:00:43 GMT

Hi Fei,

I think it should work. But you may need to add few logic in compute() to
decide which half of the parent partition is needed to output. And you need
to get the correct preferred locations for the partitions sharing the same
parent partition.


Fei Hu wrote
> Hi Liang-Chi,
> 
> Yes, you are right. I implement the following solution for this problem,
> and it works. But I am not sure if it is efficient:
> 
> I double the partitions of the parent RDD, and then use the new partitions
> and parent RDD to construct the target RDD. In the compute() function of
> the target RDD, I use the input partition to get the corresponding parent
> partition, and get the half elements in the parent partitions as the
> output
> of the computing function.
> 
> Thanks,
> Fei
> 
> On Sun, Jan 15, 2017 at 11:01 PM, Liang-Chi Hsieh &lt;

> viirya@

> &gt; wrote:
> 
>>
>> Hi,
>>
>> When calling `coalesce` with `shuffle = false`, it is going to produce at
>> most min(numPartitions, previous RDD's number of partitions). So I think
>> it
>> can't be used to double the number of partitions.
>>
>>
>> Anastasios Zouzias wrote
>> > Hi Fei,
>> >
>> > How you tried coalesce(numPartitions: Int, shuffle: Boolean = false) ?
>> >
>> > https://github.com/apache/spark/blob/branch-1.6/core/
>> src/main/scala/org/apache/spark/rdd/RDD.scala#L395
>> >
>> > coalesce is mostly used for reducing the number of partitions before
>> > writing to HDFS, but it might still be a narrow dependency (satisfying
>> > your
>> > requirements) if you increase the # of partitions.
>> >
>> > Best,
>> > Anastasios
>> >
>> > On Sun, Jan 15, 2017 at 12:58 AM, Fei Hu &lt;
>>
>> > hufei68@
>>
>> > &gt; wrote:
>> >
>> >> Dear all,
>> >>
>> >> I want to equally divide a RDD partition into two partitions. That
>> means,
>> >> the first half of elements in the partition will create a new
>> partition,
>> >> and the second half of elements in the partition will generate another
>> >> new
>> >> partition. But the two new partitions are required to be at the same
>> node
>> >> with their parent partition, which can help get high data locality.
>> >>
>> >> Is there anyone who knows how to implement it or any hints for it?
>> >>
>> >> Thanks in advance,
>> >> Fei
>> >>
>> >>
>> >
>> >
>> > --
>> > -- Anastasios Zouzias
>> > &lt;
>>
>> > azo@.ibm
>>
>> > &gt;
>>
>>
>>
>>
>>
>> -----
>> Liang-Chi Hsieh | @viirya
>> Spark Technology Center
>> http://www.spark.tc/
>> --
>> View this message in context: http://apache-spark-
>> developers-list.1001551.n3.nabble.com/Equally-split-a-
>> RDD-partition-into-two-partition-at-the-same-node-tp20597p20608.html
>> Sent from the Apache Spark Developers List mailing list archive at
>> Nabble.com.
>>
>> ---------------------------------------------------------------------
>> To unsubscribe e-mail: 

> dev-unsubscribe@.apache

>>
>>





-----
Liang-Chi Hsieh | @viirya 
Spark Technology Center 
http://www.spark.tc/ 
--
View this message in context: http://apache-spark-developers-list.1001551.n3.nabble.com/Equally-split-a-RDD-partition-into-two-partition-at-the-same-node-tp20597p20613.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe e-mail: dev-unsubscribe@spark.apache.org


Mime
View raw message