Return-Path: X-Original-To: apmail-spark-commits-archive@minotaur.apache.org Delivered-To: apmail-spark-commits-archive@minotaur.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id E132811A35 for ; Thu, 15 May 2014 08:03:17 +0000 (UTC) Received: (qmail 56730 invoked by uid 500); 15 May 2014 01:20:30 -0000 Delivered-To: apmail-spark-commits-archive@spark.apache.org Received: (qmail 56687 invoked by uid 500); 15 May 2014 01:20:30 -0000 Mailing-List: contact commits-help@spark.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@spark.apache.org Delivered-To: mailing list commits@spark.apache.org Received: (qmail 56671 invoked by uid 99); 15 May 2014 01:20:30 -0000 Received: from tyr.zones.apache.org (HELO tyr.zones.apache.org) (140.211.11.114) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 15 May 2014 01:20:30 +0000 Received: by tyr.zones.apache.org (Postfix, from userid 65534) id 572F59029EA; Thu, 15 May 2014 01:20:30 +0000 (UTC) Content-Type: text/plain; charset="us-ascii" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit From: rxin@apache.org To: commits@spark.apache.org Message-Id: <72d700a4b84b44cb98a94eda233e065d@git.apache.org> X-Mailer: ASF-Git Admin Mailer Subject: git commit: default task number misleading in several places Date: Thu, 15 May 2014 01:20:30 +0000 (UTC) Repository: spark Updated Branches: refs/heads/master 44165fc91 -> 2f639957f default task number misleading in several places private[streaming] def defaultPartitioner(numPartitions: Int = self.ssc.sc.defaultParallelism){ new HashPartitioner(numPartitions) } it represents that the default task number in Spark Streaming relies on the variable defaultParallelism in SparkContext, which is decided by the config property spark.default.parallelism the property "spark.default.parallelism" refers to https://github.com/apache/spark/pull/389 Author: Chen Chao Closes #766 from CrazyJvm/patch-7 and squashes the following commits: 0b7efba [Chen Chao] Update streaming-programming-guide.md cc5b66c [Chen Chao] default task number misleading in several places Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/2f639957 Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/2f639957 Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/2f639957 Branch: refs/heads/master Commit: 2f639957f0bf70dddf1e698aa9e26007fb58bc67 Parents: 44165fc Author: Chen Chao Authored: Wed May 14 18:20:20 2014 -0700 Committer: Reynold Xin Committed: Wed May 14 18:20:20 2014 -0700 ---------------------------------------------------------------------- docs/streaming-programming-guide.md | 18 ++++++++++-------- 1 file changed, 10 insertions(+), 8 deletions(-) ---------------------------------------------------------------------- http://git-wip-us.apache.org/repos/asf/spark/blob/2f639957/docs/streaming-programming-guide.md ---------------------------------------------------------------------- diff --git a/docs/streaming-programming-guide.md b/docs/streaming-programming-guide.md index 939599a..0c125eb 100644 --- a/docs/streaming-programming-guide.md +++ b/docs/streaming-programming-guide.md @@ -522,9 +522,9 @@ common ones are as follows. reduceByKey(func, [numTasks]) When called on a DStream of (K, V) pairs, return a new DStream of (K, V) pairs where the values for each key are aggregated using the given reduce function. Note: By default, - this uses Spark's default number of parallel tasks (2 for local machine, 8 for a cluster) to - do the grouping. You can pass an optional numTasks argument to set a different - number of tasks. + this uses Spark's default number of parallel tasks (2 for local mode, and in cluster mode the number + is determined by the config property spark.default.parallelism) to do the grouping. + You can pass an optional numTasks argument to set a different number of tasks. join(otherStream, [numTasks]) @@ -743,8 +743,9 @@ said two parameters - windowLength and slideInterval. When called on a DStream of (K, V) pairs, returns a new DStream of (K, V) pairs where the values for each key are aggregated using the given reduce function func over batches in a sliding window. Note: By default, this uses Spark's default number of - parallel tasks (2 for local machine, 8 for a cluster) to do the grouping. You can pass an optional - numTasks argument to set a different number of tasks. + parallel tasks (2 for local mode, and in cluster mode the number is determined by the config + property spark.default.parallelism) to do the grouping. You can pass an optional + numTasks argument to set a different number of tasks. @@ -956,9 +957,10 @@ before further processing. ### Level of Parallelism in Data Processing Cluster resources maybe under-utilized if the number of parallel tasks used in any stage of the computation is not high enough. For example, for distributed reduce operations like `reduceByKey` -and `reduceByKeyAndWindow`, the default number of parallel tasks is 8. You can pass the level of -parallelism as an argument (see the -[`PairDStreamFunctions`](api/scala/index.html#org.apache.spark.streaming.dstream.PairDStreamFunctions) +and `reduceByKeyAndWindow`, the default number of parallel tasks is decided by the [config property] +(configuration.html#spark-properties) `spark.default.parallelism`. You can pass the level of +parallelism as an argument (see [`PairDStreamFunctions`] +(api/scala/index.html#org.apache.spark.streaming.dstream.PairDStreamFunctions) documentation), or set the [config property](configuration.html#spark-properties) `spark.default.parallelism` to change the default.