spark-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Kazuaki Ishizaki (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (SPARK-15687) Columnar execution engine
Date Mon, 06 Jun 2016 17:49:21 GMT

    [ https://issues.apache.org/jira/browse/SPARK-15687?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15316858#comment-15316858
] 

Kazuaki Ishizaki commented on SPARK-15687:
------------------------------------------

Thank you for your answers

* How we pass columnar format among operators? Currently, we use Iterater(Row) to pass data
between operators.
** We can change it to pass Iterator<ColumnBatch>
*** In my WIP (SPARK-15380), I introduced [a new iterator|https://github.com/apache/spark/pull/13171/files#diff-28cb12941b992ff680c277c651b59aa0R445].
However, this iterator is not used as an iterator for columnar storage as [here|https://github.com/apache/spark/pull/13171/files#diff-e4d7c2fc195fa8c801145928115cdcd0R178].
I need more information beyond a vanilla iterator for an effective accesses to an columnar
storage. I would like to investigate what information we actually need thru additional implementations
on columnar storage.

* Who decides which (columnar or row-oriented) data format? Logical planner, Physical planner,
or others?
** It's a good question. I'd think this is just something the physical layer should be responsible
for, since it is about physical layout.
*** Good to hear your thought. We will see how we can decide at {{PhysicalPlan}}

* Will we use Apache Arrow format as an internal format?
** Arrow seems too early right now and it's unlikely we'd want to depend our internal format
on an external project. We can however make the format close to it so it would be easy to
integrate.
*** I see. According to my experience, to use ```ColumnVector``` can abstract physical layout
of a columnar storage. We can use our internal format, and also use Arrow format if data is
read from external Arrow storage.

* We have two internal columnar formats: ColumnarBatch and CachedBatch. Will we integrate
these two into one?
** Yes - possibly keeping just ColumnBatch.
*** Great to hear this. I like this idea. I have one terrible experience that I want to share.
When I add an field whose type is {{Array[DataType]}} into {{CachedBatch}}}, it causes performance
degradation. This is because {{SizeEstimator.estimate()}}, which spends longer time for this
{{CachedBatch}}, is launched by {{+=}} at [this statement|https://github.com/apache/spark/blob/4a6e78abd9d5edc4a5092738dff0006bbe202a89/core/src/main/scala/org/apache/spark/storage/memory/MemoryStore.scala#L214]
when {{values.next()}} returns {{interator[CachedBatch]}.
We may have to avoid this situation when we change {{CachedBatch}}.


*

*

> Columnar execution engine
> -------------------------
>
>                 Key: SPARK-15687
>                 URL: https://issues.apache.org/jira/browse/SPARK-15687
>             Project: Spark
>          Issue Type: New Feature
>          Components: SQL
>            Reporter: Reynold Xin
>            Priority: Critical
>
> This ticket tracks progress in making the entire engine columnar, especially in the context
of nested data type support.
> In Spark 2.0, we have used the internal column batch interface in Parquet reading (via
a vectorized Parquet decoder) and low cardinality aggregation. Other parts of the engine are
already using whole-stage code generation, which is in many ways more efficient than a columnar
execution engine for flat data types.
> The goal here is to figure out a story to work towards making column batch the common
data exchange format between operators outside whole-stage code generation, as well as with
external systems (e.g. Pandas).
> Some of the important questions to answer are:
> From the architectural perspective: 
> - What is the end state architecture?
> - Should aggregation be columnar?
> - Should sorting be columnar?
> - How do we encode nested data? What are the operations on nested data, and how do we
handle these operations in a columnar format?
> - What is the transition plan towards the end state?
> From an external API perspective:
> - Can we expose a more efficient column batch user-defined function API?
> - How do we leverage this to integrate with 3rd party tools?
> - Can we have a spec for a fixed version of the column batch format that can be externalized
and use that in data source API v2?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org


Mime
View raw message