spark-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Patrick Wendell (JIRA)" <j...@apache.org>
Subject [jira] [Updated] (SPARK-5863) Improve performance of convertToScala codepath.
Date Mon, 23 Mar 2015 00:04:11 GMT

     [ https://issues.apache.org/jira/browse/SPARK-5863?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Patrick Wendell updated SPARK-5863:
-----------------------------------
    Target Version/s: 1.4.0  (was: 1.3.1, 1.4.0)

> Improve performance of convertToScala codepath.
> -----------------------------------------------
>
>                 Key: SPARK-5863
>                 URL: https://issues.apache.org/jira/browse/SPARK-5863
>             Project: Spark
>          Issue Type: Improvement
>          Components: SQL
>    Affects Versions: 1.2.0, 1.2.1
>            Reporter: Cristian
>            Priority: Critical
>
> Was doing some perf testing on reading parquet files and noticed that moving from Spark
1.1 to 1.2 the performance is 3x worse. In the profiler the culprit showed up as being in
ScalaReflection.convertRowToScala.
> Particularly this zip is the issue:
> {code}
> r.toSeq.zip(schema.fields.map(_.dataType))
> {code}
> I see there's a comment on that currently that this is slow but it wasn't fixed. This
actually produces a 3x degradation in parquet read performance, at least in my test case.
> Edit: the map is part of the issue as well. This whole code block is in a tight loop
and allocates a new ListBuffer that needs to grow for each transformation. A possible solution
is to change to using seq.view which would allocate iterators instead.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org


Mime
View raw message