Return-Path: X-Original-To: archive-asf-public-internal@cust-asf2.ponee.io Delivered-To: archive-asf-public-internal@cust-asf2.ponee.io Received: from cust-asf.ponee.io (cust-asf.ponee.io [163.172.22.183]) by cust-asf2.ponee.io (Postfix) with ESMTP id 32B75200BFE for ; Mon, 16 Jan 2017 09:01:19 +0100 (CET) Received: by cust-asf.ponee.io (Postfix) id 315E1160B30; Mon, 16 Jan 2017 08:01:19 +0000 (UTC) Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by cust-asf.ponee.io (Postfix) with SMTP id 7BC3D160B22 for ; Mon, 16 Jan 2017 09:01:18 +0100 (CET) Received: (qmail 41398 invoked by uid 500); 16 Jan 2017 08:01:17 -0000 Mailing-List: contact dev-help@spark.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Delivered-To: mailing list dev@spark.apache.org Received: (qmail 41386 invoked by uid 99); 16 Jan 2017 08:01:16 -0000 Received: from pnap-us-west-generic-nat.apache.org (HELO spamd4-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 16 Jan 2017 08:01:16 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd4-us-west.apache.org (ASF Mail Server at spamd4-us-west.apache.org) with ESMTP id 52800C0748 for ; Mon, 16 Jan 2017 08:01:16 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd4-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: 3.487 X-Spam-Level: *** X-Spam-Status: No, score=3.487 tagged_above=-999 required=6.31 tests=[DKIM_ADSP_CUSTOM_MED=0.001, NML_ADSP_CUSTOM_MED=1.2, RCVD_IN_DNSWL_NONE=-0.0001, SPF_SOFTFAIL=0.972, URIBL_BLOCKED=0.001, URI_HEX=1.313] autolearn=disabled Received: from mx1-lw-us.apache.org ([10.40.0.8]) by localhost (spamd4-us-west.apache.org [10.40.0.11]) (amavisd-new, port 10024) with ESMTP id 5eBckutXbWa8 for ; Mon, 16 Jan 2017 08:01:15 +0000 (UTC) Received: from mwork.nabble.com (mwork.nabble.com [162.253.133.43]) by mx1-lw-us.apache.org (ASF Mail Server at mx1-lw-us.apache.org) with ESMTP id C1D1C5F2C5 for ; Mon, 16 Jan 2017 08:01:14 +0000 (UTC) Received: from mben.nabble.com (unknown [162.253.133.72]) by mwork.nabble.com (Postfix) with ESMTP id 9522F15BE54EC for ; Mon, 16 Jan 2017 01:00:43 -0700 (MST) Date: Mon, 16 Jan 2017 01:00:43 -0700 (MST) From: Liang-Chi Hsieh To: dev@spark.apache.org Message-ID: <1484553643628-20613.post@n3.nabble.com> In-Reply-To: References: <1484539312613-20608.post@n3.nabble.com> Subject: Re: Equally split a RDD partition into two partition at the same node MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 7bit archived-at: Mon, 16 Jan 2017 08:01:19 -0000 Hi Fei, I think it should work. But you may need to add few logic in compute() to decide which half of the parent partition is needed to output. And you need to get the correct preferred locations for the partitions sharing the same parent partition. Fei Hu wrote > Hi Liang-Chi, > > Yes, you are right. I implement the following solution for this problem, > and it works. But I am not sure if it is efficient: > > I double the partitions of the parent RDD, and then use the new partitions > and parent RDD to construct the target RDD. In the compute() function of > the target RDD, I use the input partition to get the corresponding parent > partition, and get the half elements in the parent partitions as the > output > of the computing function. > > Thanks, > Fei > > On Sun, Jan 15, 2017 at 11:01 PM, Liang-Chi Hsieh < > viirya@ > > wrote: > >> >> Hi, >> >> When calling `coalesce` with `shuffle = false`, it is going to produce at >> most min(numPartitions, previous RDD's number of partitions). So I think >> it >> can't be used to double the number of partitions. >> >> >> Anastasios Zouzias wrote >> > Hi Fei, >> > >> > How you tried coalesce(numPartitions: Int, shuffle: Boolean = false) ? >> > >> > https://github.com/apache/spark/blob/branch-1.6/core/ >> src/main/scala/org/apache/spark/rdd/RDD.scala#L395 >> > >> > coalesce is mostly used for reducing the number of partitions before >> > writing to HDFS, but it might still be a narrow dependency (satisfying >> > your >> > requirements) if you increase the # of partitions. >> > >> > Best, >> > Anastasios >> > >> > On Sun, Jan 15, 2017 at 12:58 AM, Fei Hu < >> >> > hufei68@ >> >> > > wrote: >> > >> >> Dear all, >> >> >> >> I want to equally divide a RDD partition into two partitions. That >> means, >> >> the first half of elements in the partition will create a new >> partition, >> >> and the second half of elements in the partition will generate another >> >> new >> >> partition. But the two new partitions are required to be at the same >> node >> >> with their parent partition, which can help get high data locality. >> >> >> >> Is there anyone who knows how to implement it or any hints for it? >> >> >> >> Thanks in advance, >> >> Fei >> >> >> >> >> > >> > >> > -- >> > -- Anastasios Zouzias >> > < >> >> > azo@.ibm >> >> > > >> >> >> >> >> >> ----- >> Liang-Chi Hsieh | @viirya >> Spark Technology Center >> http://www.spark.tc/ >> -- >> View this message in context: http://apache-spark- >> developers-list.1001551.n3.nabble.com/Equally-split-a- >> RDD-partition-into-two-partition-at-the-same-node-tp20597p20608.html >> Sent from the Apache Spark Developers List mailing list archive at >> Nabble.com. >> >> --------------------------------------------------------------------- >> To unsubscribe e-mail: > dev-unsubscribe@.apache >> >> ----- Liang-Chi Hsieh | @viirya Spark Technology Center http://www.spark.tc/ -- View this message in context: http://apache-spark-developers-list.1001551.n3.nabble.com/Equally-split-a-RDD-partition-into-two-partition-at-the-same-node-tp20597p20613.html Sent from the Apache Spark Developers List mailing list archive at Nabble.com. --------------------------------------------------------------------- To unsubscribe e-mail: dev-unsubscribe@spark.apache.org