spark-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Apache Spark (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (SPARK-16196) Optimize in-memory scan performance using ColumnarBatches
Date Fri, 24 Jun 2016 21:33:16 GMT

    [ https://issues.apache.org/jira/browse/SPARK-16196?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15348703#comment-15348703
] 

Apache Spark commented on SPARK-16196:
--------------------------------------

User 'andrewor14' has created a pull request for this issue:
https://github.com/apache/spark/pull/13899

> Optimize in-memory scan performance using ColumnarBatches
> ---------------------------------------------------------
>
>                 Key: SPARK-16196
>                 URL: https://issues.apache.org/jira/browse/SPARK-16196
>             Project: Spark
>          Issue Type: Bug
>          Components: SQL
>    Affects Versions: 2.0.0
>            Reporter: Andrew Or
>            Assignee: Andrew Or
>
> A simple benchmark such as the following reveals inefficiencies in the existing in-memory
scan implementation:
> {code}
> spark.range(N)
>   .selectExpr("id", "floor(rand() * 10000) as k")
>   .createOrReplaceTempView("test")
> val ds = spark.sql("select count(k), count(id) from test").cache()
> ds.collect()
> ds.collect()
> {code}
> There are many reasons why caching is slow. The biggest is that compression takes a long
time. The second is that there are a lot of virtual function calls in this hot code path since
the rows are processed using iterators. Further, the rows are converted to and from ByteBuffers,
which are slow to read in general.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org


Mime
View raw message