Return-Path: X-Original-To: archive-asf-public-internal@cust-asf2.ponee.io Delivered-To: archive-asf-public-internal@cust-asf2.ponee.io Received: from cust-asf.ponee.io (cust-asf.ponee.io [163.172.22.183]) by cust-asf2.ponee.io (Postfix) with ESMTP id E13A52009E2 for ; Wed, 1 Jun 2016 19:13:00 +0200 (CEST) Received: by cust-asf.ponee.io (Postfix) id E010C160A4E; Wed, 1 Jun 2016 17:13:00 +0000 (UTC) Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by cust-asf.ponee.io (Postfix) with SMTP id 3B08D160A45 for ; Wed, 1 Jun 2016 19:13:00 +0200 (CEST) Received: (qmail 51950 invoked by uid 500); 1 Jun 2016 17:12:59 -0000 Mailing-List: contact issues-help@spark.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Delivered-To: mailing list issues@spark.apache.org Received: (qmail 51930 invoked by uid 99); 1 Jun 2016 17:12:59 -0000 Received: from arcas.apache.org (HELO arcas) (140.211.11.28) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 01 Jun 2016 17:12:59 +0000 Received: from arcas.apache.org (localhost [127.0.0.1]) by arcas (Postfix) with ESMTP id 560DB2C044E for ; Wed, 1 Jun 2016 17:12:59 +0000 (UTC) Date: Wed, 1 Jun 2016 17:12:59 +0000 (UTC) From: "Kazuaki Ishizaki (JIRA)" To: issues@spark.apache.org Message-ID: In-Reply-To: References: Subject: [jira] [Commented] (SPARK-15687) Columnar execution engine MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 archived-at: Wed, 01 Jun 2016 17:13:01 -0000 [ https://issues.apache.org/jira/browse/SPARK-15687?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15310663#comment-15310663 ] Kazuaki Ishizaki commented on SPARK-15687: ------------------------------------------ Thank you for creating interesting JIRA entry. Based on my experiments (SPARK-13805, SPARK-14098, SPARK-15117, and SPARK-15380) to enable columnar storage at whole stage codegen, I have some (implementation perspective?) questions: * How we pass columnar format among operators? Currently, we use {{Iterater(Row)}} to pass data between operators. * Who decides which (columnar or row-oriented) data format? Logical planner, Physical planner, or others? * Will we use Apache Arrow format as an internal format? * We have two internal columnar formats: {{ColumnarBatch}} and {{CachedBatch}}. Will we integrate these two into one? > Columnar execution engine > ------------------------- > > Key: SPARK-15687 > URL: https://issues.apache.org/jira/browse/SPARK-15687 > Project: Spark > Issue Type: New Feature > Components: SQL > Reporter: Reynold Xin > Priority: Critical > > This ticket tracks progress in making the entire engine columnar, especially in the context of nested data type support. > In Spark 2.0, we have used the internal column batch interface in Parquet reading (via a vectorized Parquet decoder) and low cardinality aggregation. Other parts of the engine are already using whole-stage code generation, which is in many ways more efficient than a columnar execution engine for flat data types. > The goal here is to figure out a story to work towards making column batch the common data exchange format between operators outside whole-stage code generation, as well as with external systems (e.g. Pandas). > Some of the important questions to answer are: > From the architectural perspective: > - What is the end state architecture? > - Should aggregation be columnar? > - Should sorting be columnar? > - How do we encode nested data? What are the operations on nested data, and how do we handle these operations in a columnar format? > - What is the transition plan towards the end state? > From an external API perspective: > - Can we expose a more efficient column batch user-defined function API? > - How do we leverage this to integrate with 3rd party tools? > - Can we have a spec for a fixed version of the column batch format that can be externalized and use that in data source API v2? -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org For additional commands, e-mail: issues-help@spark.apache.org