Return-Path: X-Original-To: archive-asf-public-internal@cust-asf2.ponee.io Delivered-To: archive-asf-public-internal@cust-asf2.ponee.io Received: from cust-asf.ponee.io (cust-asf.ponee.io [163.172.22.183]) by cust-asf2.ponee.io (Postfix) with ESMTP id 5ADED200B13 for ; Wed, 1 Jun 2016 02:30:08 +0200 (CEST) Received: by cust-asf.ponee.io (Postfix) id 594D4160A46; Wed, 1 Jun 2016 00:30:08 +0000 (UTC) Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by cust-asf.ponee.io (Postfix) with SMTP id 9F5E9160A44 for ; Wed, 1 Jun 2016 02:30:07 +0200 (CEST) Received: (qmail 21237 invoked by uid 500); 1 Jun 2016 00:30:06 -0000 Mailing-List: contact commits-help@spark.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Delivered-To: mailing list commits@spark.apache.org Received: (qmail 21228 invoked by uid 99); 1 Jun 2016 00:30:06 -0000 Received: from git1-us-west.apache.org (HELO git1-us-west.apache.org) (140.211.11.23) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 01 Jun 2016 00:30:06 +0000 Received: by git1-us-west.apache.org (ASF Mail Server at git1-us-west.apache.org, from userid 33) id 7D35FDFB74; Wed, 1 Jun 2016 00:30:06 +0000 (UTC) Content-Type: text/plain; charset="us-ascii" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit From: rxin@apache.org To: commits@spark.apache.org Message-Id: <7ca205fe48844c8e8e5aa516f3cbfc0a@git.apache.org> X-Mailer: ASF-Git Admin Mailer Subject: spark git commit: [SPARK-15680][SQL] Disable comments in generated code in order to avoid perf. issues Date: Wed, 1 Jun 2016 00:30:06 +0000 (UTC) archived-at: Wed, 01 Jun 2016 00:30:08 -0000 Repository: spark Updated Branches: refs/heads/master 223f1d58c -> 8ca01a6fe [SPARK-15680][SQL] Disable comments in generated code in order to avoid perf. issues ## What changes were proposed in this pull request? In benchmarks involving tables with very wide and complex schemas (thousands of columns, deep nesting), I noticed that significant amounts of time (order of tens of seconds per task) were being spent generating comments during the code generation phase. The root cause of the performance problem stems from the fact that calling toString() on a complex expression can involve thousands of string concatenations, resulting in huge amounts (tens of gigabytes) of character array allocation and copying. In the long term, we can avoid this problem by passing StringBuilders down the tree and using them to accumulate output. As a short-term workaround, this patch guards comment generation behind a flag and disables comments by default (for wide tables / complex queries, these comments were being truncated prior to display and thus were not very useful). ## How was this patch tested? This was tested manually by running a Spark SQL query over an empty table with a very wide schema obtained from a real workload. Disabling comments brought the per-task time down from about 16 seconds to 600 milliseconds. Author: Josh Rosen Closes #13421 from JoshRosen/disable-line-comments-in-codegen. Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/8ca01a6f Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/8ca01a6f Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/8ca01a6f Branch: refs/heads/master Commit: 8ca01a6feb4935b1a3815cfbff1b90ccc6f60984 Parents: 223f1d5 Author: Josh Rosen Authored: Tue May 31 17:30:03 2016 -0700 Committer: Reynold Xin Committed: Tue May 31 17:30:03 2016 -0700 ---------------------------------------------------------------------- .../expressions/codegen/CodeGenerator.scala | 23 ++++++++++++++------ 1 file changed, 16 insertions(+), 7 deletions(-) ---------------------------------------------------------------------- http://git-wip-us.apache.org/repos/asf/spark/blob/8ca01a6f/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/codegen/CodeGenerator.scala ---------------------------------------------------------------------- diff --git a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/codegen/CodeGenerator.scala b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/codegen/CodeGenerator.scala index 93e477e..9657f26 100644 --- a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/codegen/CodeGenerator.scala +++ b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/codegen/CodeGenerator.scala @@ -24,6 +24,7 @@ import com.google.common.cache.{CacheBuilder, CacheLoader} import org.codehaus.janino.ClassBodyEvaluator import scala.language.existentials +import org.apache.spark.SparkEnv import org.apache.spark.internal.Logging import org.apache.spark.sql.catalyst.InternalRow import org.apache.spark.sql.catalyst.expressions._ @@ -724,15 +725,23 @@ class CodegenContext { /** * Register a comment and return the corresponding place holder */ - def registerComment(text: String): String = { - val name = freshName("c") - val comment = if (text.contains("\n") || text.contains("\r")) { - text.split("(\r\n)|\r|\n").mkString("/**\n * ", "\n * ", "\n */") + def registerComment(text: => String): String = { + // By default, disable comments in generated code because computing the comments themselves can + // be extremely expensive in certain cases, such as deeply-nested expressions which operate over + // inputs with wide schemas. For more details on the performance issues that motivated this + // flat, see SPARK-15680. + if (SparkEnv.get != null && SparkEnv.get.conf.getBoolean("spark.sql.codegen.comments", false)) { + val name = freshName("c") + val comment = if (text.contains("\n") || text.contains("\r")) { + text.split("(\r\n)|\r|\n").mkString("/**\n * ", "\n * ", "\n */") + } else { + s"// $text" + } + placeHolderToComments += (name -> comment) + s"/*$name*/" } else { - s"// $text" + "" } - placeHolderToComments += (name -> comment) - s"/*$name*/" } } --------------------------------------------------------------------- To unsubscribe, e-mail: commits-unsubscribe@spark.apache.org For additional commands, e-mail: commits-help@spark.apache.org