Mailing-List: contact dev-help@spark.apache.org; run by ezmlm
Precedence: bulk
Date: Mon, 16 Jan 2017 01:00:43 -0700 (MST)
From: Liang-Chi Hsieh <viirya@gmail.com>
To: dev@spark.apache.org
Message-ID: <1484553643628-20613.post@n3.nabble.com>
In-Reply-To: <CANaLfm9vYMnPrVF-RS98UxKkHdXEueZ5Yd347JB-dJ1qoYYwww@mail.gmail.com>
References: <CANaLfm_L5HFo1Ue5EvEUAJn4gr22cc_Dv--x74ha15oeXhsj-Q@mail.gmail.com> <CACmKW8Xf=zZi4DBkb_jtRY9YKRxE-NN6FhG0jMAy61rPtZHOOw@mail.gmail.com> <1484539312613-20608.post@n3.nabble.com> <CANaLfm9vYMnPrVF-RS98UxKkHdXEueZ5Yd347JB-dJ1qoYYwww@mail.gmail.com>
Subject: Re: Equally split a RDD partition into two partition at the same
 node
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Transfer-Encoding: 7bit
archived-at: Mon, 16 Jan 2017 08:01:19 -0000


Hi Fei,

I think it should work. But you may need to add few logic in compute() to
decide which half of the parent partition is needed to output. And you need
to get the correct preferred locations for the partitions sharing the same
parent partition.


Fei Hu wrote
> Hi Liang-Chi,
> 
> Yes, you are right. I implement the following solution for this problem,
> and it works. But I am not sure if it is efficient:
> 
> I double the partitions of the parent RDD, and then use the new partitions
> and parent RDD to construct the target RDD. In the compute() function of
> the target RDD, I use the input partition to get the corresponding parent
> partition, and get the half elements in the parent partitions as the
> output
> of the computing function.
> 
> Thanks,
> Fei
> 
> On Sun, Jan 15, 2017 at 11:01 PM, Liang-Chi Hsieh &lt;

> viirya@

> &gt; wrote:
> 
>>
>> Hi,
>>
>> When calling `coalesce` with `shuffle = false`, it is going to produce at
>> most min(numPartitions, previous RDD's number of partitions). So I think
>> it
>> can't be used to double the number of partitions.
>>
>>
>> Anastasios Zouzias wrote
>> > Hi Fei,
>> >
>> > How you tried coalesce(numPartitions: Int, shuffle: Boolean = false) ?
>> >
>> > https://github.com/apache/spark/blob/branch-1.6/core/
>> src/main/scala/org/apache/spark/rdd/RDD.scala#L395
>> >
>> > coalesce is mostly used for reducing the number of partitions before
>> > writing to HDFS, but it might still be a narrow dependency (satisfying
>> > your
>> > requirements) if you increase the # of partitions.
>> >
>> > Best,
>> > Anastasios
>> >
>> > On Sun, Jan 15, 2017 at 12:58 AM, Fei Hu &lt;
>>
>> > hufei68@
>>
>> > &gt; wrote:
>> >
>> >> Dear all,
>> >>
>> >> I want to equally divide a RDD partition into two partitions. That
>> means,
>> >> the first half of elements in the partition will create a new
>> partition,
>> >> and the second half of elements in the partition will generate another
>> >> new
>> >> partition. But the two new partitions are required to be at the same
>> node
>> >> with their parent partition, which can help get high data locality.
>> >>
>> >> Is there anyone who knows how to implement it or any hints for it?
>> >>
>> >> Thanks in advance,
>> >> Fei
>> >>
>> >>
>> >
>> >
>> > --
>> > -- Anastasios Zouzias
>> > &lt;
>>
>> > azo@.ibm
>>
>> > &gt;
>>
>>
>>
>>
>>
>> -----
>> Liang-Chi Hsieh | @viirya
>> Spark Technology Center
>> http://www.spark.tc/
>> --
>> View this message in context: http://apache-spark-
>> developers-list.1001551.n3.nabble.com/Equally-split-a-
>> RDD-partition-into-two-partition-at-the-same-node-tp20597p20608.html
>> Sent from the Apache Spark Developers List mailing list archive at
>> Nabble.com.
>>
>> ---------------------------------------------------------------------
>> To unsubscribe e-mail: 

> dev-unsubscribe@.apache

>>
>>


-----
Liang-Chi Hsieh | @viirya 
Spark Technology Center 
http://www.spark.tc/ 
--
View this message in context: http://apache-spark-developers-list.1001551.n3.nabble.com/Equally-split-a-RDD-partition-into-two-partition-at-the-same-node-tp20597p20613.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe e-mail: dev-unsubscribe@spark.apache.org