Mailing-List: contact commits-help@spark.apache.org; run by ezmlm
Precedence: bulk
Reply-To: dev@spark.apache.org
Content-Type: text/plain; charset="us-ascii"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
From: rxin@apache.org
To: commits@spark.apache.org
Message-Id: <72d700a4b84b44cb98a94eda233e065d@git.apache.org>
Subject: git commit: default task number misleading in several places
Date: Thu, 15 May 2014 01:20:30 +0000 (UTC)

Repository: spark
Updated Branches:
  refs/heads/master 44165fc91 -> 2f639957f


default task number misleading in several places

  private[streaming] def defaultPartitioner(numPartitions: Int = self.ssc.sc.defaultParallelism){
    new HashPartitioner(numPartitions)
  }

it represents that the default task number in Spark Streaming relies on the variable defaultParallelism in SparkContext, which is decided by the config property spark.default.parallelism

the property "spark.default.parallelism" refers to https://github.com/apache/spark/pull/389

Author: Chen Chao <crazyjvm@gmail.com>

Closes #766 from CrazyJvm/patch-7 and squashes the following commits:

0b7efba [Chen Chao] Update streaming-programming-guide.md
cc5b66c [Chen Chao] default task number misleading in several places


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/2f639957
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/2f639957
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/2f639957

Branch: refs/heads/master
Commit: 2f639957f0bf70dddf1e698aa9e26007fb58bc67
Parents: 44165fc
Author: Chen Chao <crazyjvm@gmail.com>
Authored: Wed May 14 18:20:20 2014 -0700
Committer: Reynold Xin <rxin@apache.org>
Committed: Wed May 14 18:20:20 2014 -0700

----------------------------------------------------------------------
 docs/streaming-programming-guide.md | 18 ++++++++++--------
 1 file changed, 10 insertions(+), 8 deletions(-)
----------------------------------------------------------------------


http://git-wip-us.apache.org/repos/asf/spark/blob/2f639957/docs/streaming-programming-guide.md
----------------------------------------------------------------------
diff --git a/docs/streaming-programming-guide.md b/docs/streaming-programming-guide.md
index 939599a..0c125eb 100644
--- a/docs/streaming-programming-guide.md
+++ b/docs/streaming-programming-guide.md
@@ -522,9 +522,9 @@ common ones are as follows.
   <td> <b>reduceByKey</b>(<i>func</i>, [<i>numTasks</i>]) </td>
   <td> When called on a DStream of (K, V) pairs, return a new DStream of (K, V) pairs where the
   values for each key are aggregated using the given reduce function. <b>Note:</b> By default,
-  this uses Spark's default number of parallel tasks (2 for local machine, 8 for a cluster) to
-  do the grouping. You can pass an optional <code>numTasks</code> argument to set a different
-  number of tasks.</td>
+  this uses Spark's default number of parallel tasks (2 for local mode, and in cluster mode the number
+  is determined by the config property <code>spark.default.parallelism</code>) to do the grouping.
+  You can pass an optional <code>numTasks</code> argument to set a different number of tasks.</td>
 </tr>
 <tr>
   <td> <b>join</b>(<i>otherStream</i>, [<i>numTasks</i>]) </td>
@@ -743,8 +743,9 @@ said two parameters - <i>windowLength</i> and <i>slideInterval</i>.
   <td> When called on a DStream of (K, V) pairs, returns a new DStream of (K, V)
   pairs where the values for each key are aggregated using the given reduce function <i>func</i>
   over batches in a sliding window. <b>Note:</b> By default, this uses Spark's default number of
-  parallel tasks (2 for local machine, 8 for a cluster) to do the grouping. You can pass an optional
-   <code>numTasks</code> argument to set a different number of tasks.
+  parallel tasks (2 for local mode, and in cluster mode the number is determined by the config
+  property <code>spark.default.parallelism</code>) to do the grouping. You can pass an optional
+  <code>numTasks</code> argument to set a different number of tasks.
   </td>
 </tr>
 <tr>
@@ -956,9 +957,10 @@ before further processing.
 ### Level of Parallelism in Data Processing
 Cluster resources maybe under-utilized if the number of parallel tasks used in any stage of the
 computation is not high enough. For example, for distributed reduce operations like `reduceByKey`
-and `reduceByKeyAndWindow`, the default number of parallel tasks is 8. You can pass the level of
-parallelism as an argument (see the
-[`PairDStreamFunctions`](api/scala/index.html#org.apache.spark.streaming.dstream.PairDStreamFunctions)
+and `reduceByKeyAndWindow`, the default number of parallel tasks is decided by the [config property]
+(configuration.html#spark-properties) `spark.default.parallelism`. You can pass the level of
+parallelism as an argument (see [`PairDStreamFunctions`]
+(api/scala/index.html#org.apache.spark.streaming.dstream.PairDStreamFunctions)
 documentation), or set the [config property](configuration.html#spark-properties)
 `spark.default.parallelism` to change the default.