spark-reviews mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From GitBox <...@apache.org>
Subject [GitHub] [spark] HyukjinKwon commented on a change in pull request #27616: [SPARK-30864] [SQL]add the user guide for Adaptive Query Execution
Date Tue, 10 Mar 2020 08:42:23 GMT
HyukjinKwon commented on a change in pull request #27616: [SPARK-30864] [SQL]add the user guide
for Adaptive Query Execution
URL: https://github.com/apache/spark/pull/27616#discussion_r390161407
 
 

 ##########
 File path: docs/sql-performance-tuning.md
 ##########
 @@ -186,3 +186,61 @@ The "REPARTITION_BY_RANGE" hint must have column names and a partition
number is
     SELECT /*+ REPARTITION(3, c) */ * FROM t
     SELECT /*+ REPARTITION_BY_RANGE(c) */ * FROM t
     SELECT /*+ REPARTITION_BY_RANGE(3, c) */ * FROM t
+
+## Adaptive Query Execution
+Adaptive Query Execution (AQE) is an optimization technique in Spark SQL that makes use of
the runtime statistics to choose the most efficient query execution plan. AQE is disabled
by default. Spark SQL can use the umbrella configuration of `spark.sql.adaptive.enabled` to
control whether turn it on/off. As of Spark 3.0, there are three major features in AQE, including
coalescing post-shuffle partitions, local shuffle reader optimization and skewed join optimization.
+ ### Coalescing Post Shuffle Partition Number
+ This feature coalesces the post shuffle partitions based on the map output statistics when
`spark.sql.adaptive.enabled` and `spark.sql.adaptive.coalescePartitions.enabled` configuration
properties are both enabled. There are four following sub-configurations in this optimization
rule. This feature simplifies the tuning of shuffle partitions number when running queries.
You don't need to set a proper shuffle partition number to fit your dataset. You just need
to set a large enough number and Spark can pick the proper shuffle partition number at runtime.
+ <table class="table">
+   <tr><th>Property Name</th><th>Default</th><th>Meaning</th></tr>
+   <tr>
+     <td><code>spark.sql.adaptive.coalescePartitions.enabled</code></td>
+     <td>true</td>
+     <td>
+       When true and <code>spark.sql.adaptive.enabled</code> is enabled, spark
will reduce the post shuffle partitions number based on the map output statistics.
+     </td>
+   </tr>
+   <tr>
+     <td><code>spark.sql.adaptive.coalescePartitions.minPartitionNum</code></td>
+     <td>1</td>
+     <td>
+       The advisory minimum number of post-shuffle partitions used when <code>spark.sql.adaptive.enabled</code>
and <code>spark.sql.adaptive.coalescePartitions.enabled</code> are both enabled.
It is suggested to be almost 2~3x of the parallelism when doing benchmark.
+     </td>
+   </tr>
+   <tr>
+     <td><code>spark.sql.adaptive.coalescePartitions.initialPartitionNum</code></td>
+     <td>200</td>
+     <td>
+       The advisory number of post-shuffle partitions used in adaptive execution. This is
used as the initial number of pre-shuffle partitions. By default it equals to <code>spark.sql.shuffle.partitions</code>.
+     </td>
+   </tr>
+   <tr>
+     <td><code>spark.sql.adaptive.advisoryPartitionSizeInBytes</code></td>
+     <td>67108864 (64 MB)</td>
+     <td>
+       The target post-shuffle input size in bytes of a task when <code>spark.sql.adaptive.enabled</code>
and <code>spark.sql.adaptive.coalescePartitions.enabled</code> are both enabled.
+     </td>
+   </tr>
+ </table>
+ 
+ ### Optimize Local Shuffle Reader
+ This feature optimize the shuffle reader to local shuffle reader when converting the sort
merge join to broadcast hash join in runtime and no additional shuffle introduced. It takes
effect when `spark.sql.adaptive.enabled` and `spark.sql.adaptive.localShuffleReader.enabled`
configuration properties are both enabled. This feature can improve the performance by saving
the network overhead of shuffle process.
+ ### Optimize Skewed Join
+ This feature choose the skewed partition and creates multi tasks to handle the skewed partition
when both enable `spark.sql.adaptive.enabled` and `spark.sql.adaptive.skewJoin.enabled`. There
are two following sub-configurations in this optimization rule. Data skew can severely downgrade
performance of join queries. And this feature can split the skewed partition into multi parallel
tasks instead of original 1 task to reduce the overhead of skewed join.
 
 Review comment:
   when enabling both `spark.sql.adaptive.enabled` and `spark.sql.adaptive.skewJoin.enabled`.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Mime
View raw message