Mailing-List: contact issues-help@spark.apache.org; run by ezmlm
Precedence: bulk
Date: Wed, 1 Jun 2016 17:12:59 +0000 (UTC)
From: "Kazuaki Ishizaki (JIRA)" <jira@apache.org>
To: issues@spark.apache.org
Message-ID: <JIRA.12974439.1464758693000.4196.1464801179348@Atlassian.JIRA>
In-Reply-To: <JIRA.12974439.1464758693000@Atlassian.JIRA>
References: <JIRA.12974439.1464758693000@Atlassian.JIRA> <JIRA.12974439.1464758693685@arcas>
Subject: [jira] [Commented] (SPARK-15687) Columnar execution engine
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: 7bit
archived-at: Wed, 01 Jun 2016 17:13:01 -0000


    [ https://issues.apache.org/jira/browse/SPARK-15687?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15310663#comment-15310663 ] 

Kazuaki Ishizaki commented on SPARK-15687:
------------------------------------------

Thank you for creating interesting JIRA entry. Based on my experiments (SPARK-13805, SPARK-14098, SPARK-15117, and SPARK-15380) to enable columnar storage at whole stage codegen, I have some (implementation perspective?) questions:
* How we pass columnar format among operators? Currently, we use {{Iterater(Row)}} to pass data between operators.
* Who decides which (columnar or row-oriented) data format? Logical planner, Physical planner, or others?
* Will we use Apache Arrow format as an internal format? 
* We have two internal columnar formats: {{ColumnarBatch}} and {{CachedBatch}}. Will we integrate these two into one?

> Columnar execution engine
> -------------------------
>
>                 Key: SPARK-15687
>                 URL: https://issues.apache.org/jira/browse/SPARK-15687
>             Project: Spark
>          Issue Type: New Feature
>          Components: SQL
>            Reporter: Reynold Xin
>            Priority: Critical
>
> This ticket tracks progress in making the entire engine columnar, especially in the context of nested data type support.
> In Spark 2.0, we have used the internal column batch interface in Parquet reading (via a vectorized Parquet decoder) and low cardinality aggregation. Other parts of the engine are already using whole-stage code generation, which is in many ways more efficient than a columnar execution engine for flat data types.
> The goal here is to figure out a story to work towards making column batch the common data exchange format between operators outside whole-stage code generation, as well as with external systems (e.g. Pandas).
> Some of the important questions to answer are:
> From the architectural perspective: 
> - What is the end state architecture?
> - Should aggregation be columnar?
> - Should sorting be columnar?
> - How do we encode nested data? What are the operations on nested data, and how do we handle these operations in a columnar format?
> - What is the transition plan towards the end state?
> From an external API perspective:
> - Can we expose a more efficient column batch user-defined function API?
> - How do we leverage this to integrate with 3rd party tools?
> - Can we have a spec for a fixed version of the column batch format that can be externalized and use that in data source API v2?


--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org