spark-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Marco Gaido (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (SPARK-22105) Dataframe has poor performance when computing on many columns with codegen
Date Sat, 10 Feb 2018 12:57:00 GMT

    [ https://issues.apache.org/jira/browse/SPARK-22105?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16359420#comment-16359420
] 

Marco Gaido commented on SPARK-22105:
-------------------------------------

[~WeichenXu123] which is the number of rows for the dataset you tested? Maybe the time for
generating/compiling the code can be a significant overhead if we have few data


> Dataframe has poor performance when computing on many columns with codegen
> --------------------------------------------------------------------------
>
>                 Key: SPARK-22105
>                 URL: https://issues.apache.org/jira/browse/SPARK-22105
>             Project: Spark
>          Issue Type: Improvement
>          Components: ML, SQL
>    Affects Versions: 2.3.0
>            Reporter: Weichen Xu
>            Priority: Minor
>
> Suppose we have a dataframe with many columns (e.g 100 columns), each column is DoubleType.
> And we need to compute avg on each column. We will find using dataframe avg will be much
slower than using RDD.aggregate.
> I observe this issue from this PR: (One pass imputer)
> https://github.com/apache/spark/pull/18902
> I also write a minimal testing code to reproduce this issue, I use computing sum to reproduce
this issue:
> https://github.com/apache/spark/compare/master...WeichenXu123:aggr_test2?expand=1
> When we compute `sum` on 100 `DoubleType` columns, dataframe avg will be about 3x slower
than `RDD.aggregate`, but if we only compute one column, dataframe avg will be much faster
than `RDD.aggregate`.
> The reason of this issue, should be the defact in dataframe codegen. Codegen will inline
everything and generate large code block. When the column number is large (e.g 100 columns),
the codegen size will be too large, which cause jvm failed to JIT and fall back to byte code
interpretation.
> This PR should address this issue:
> https://github.com/apache/spark/pull/19082
> But we need more performance test against some code in ML after above PR merged, to check
whether this issue is actually fixed.
> This JIRA used to track this performance issue.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org


Mime
View raw message