Return-Path: X-Original-To: apmail-spark-issues-archive@minotaur.apache.org Delivered-To: apmail-spark-issues-archive@minotaur.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id A04671812D for ; Mon, 18 Apr 2016 10:21:30 +0000 (UTC) Received: (qmail 16131 invoked by uid 500); 18 Apr 2016 10:21:25 -0000 Delivered-To: apmail-spark-issues-archive@spark.apache.org Received: (qmail 16099 invoked by uid 500); 18 Apr 2016 10:21:25 -0000 Mailing-List: contact issues-help@spark.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Delivered-To: mailing list issues@spark.apache.org Received: (qmail 16076 invoked by uid 99); 18 Apr 2016 10:21:25 -0000 Received: from arcas.apache.org (HELO arcas) (140.211.11.28) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 18 Apr 2016 10:21:25 +0000 Received: from arcas.apache.org (localhost [127.0.0.1]) by arcas (Postfix) with ESMTP id 6F98E2C14F4 for ; Mon, 18 Apr 2016 10:21:25 +0000 (UTC) Date: Mon, 18 Apr 2016 10:21:25 +0000 (UTC) From: "Liang-Chi Hsieh (JIRA)" To: issues@spark.apache.org Message-ID: In-Reply-To: References: Subject: [jira] [Commented] (SPARK-14083) Analyze JVM bytecode and turn closures into Catalyst expressions MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 [ https://issues.apache.org/jira/browse/SPARK-14083?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15245427#comment-15245427 ] Liang-Chi Hsieh commented on SPARK-14083: ----------------------------------------- Based on [~joshrosen]'s code, I added some comments and few Java opcodes: https://github.com/apache/spark/compare/master...viirya:expression-analysis2 > Analyze JVM bytecode and turn closures into Catalyst expressions > ---------------------------------------------------------------- > > Key: SPARK-14083 > URL: https://issues.apache.org/jira/browse/SPARK-14083 > Project: Spark > Issue Type: New Feature > Components: SQL > Reporter: Reynold Xin > > One big advantage of the Dataset API is the type safety, at the cost of performance due to heavy reliance on user-defined closures/lambdas. These closures are typically slower than expressions because we have more flexibility to optimize expressions (known data types, no virtual function calls, etc). In many cases, it's actually not going to be very difficult to look into the byte code of these closures and figure out what they are trying to do. If we can understand them, then we can turn them directly into Catalyst expressions for more optimized executions. > Some examples are: > {code} > df.map(_.name) // equivalent to expression col("name") > ds.groupBy(_.gender) // equivalent to expression col("gender") > df.filter(_.age > 18) // equivalent to expression GreaterThan(col("age"), lit(18) > df.map(_.id + 1) // equivalent to Add(col("age"), lit(1)) > {code} > The goal of this ticket is to design a small framework for byte code analysis and use that to convert closures/lambdas into Catalyst expressions in order to speed up Dataset execution. It is a little bit futuristic, but I believe it is very doable. The framework should be easy to reason about (e.g. similar to Catalyst). > Note that a big emphasis on "small" and "easy to reason about". A patch should be rejected if it is too complicated or difficult to reason about. -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org For additional commands, e-mail: issues-help@spark.apache.org