systemml-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Deron Eriksson (JIRA)" <>
Subject [jira] [Commented] (SYSTEMML-909) `determineDataFrameDimensionsIfNeeded(...)` is a bottleneck.
Date Wed, 14 Sep 2016 18:20:22 GMT


Deron Eriksson commented on SYSTEMML-909:

[] [~mboehm7]
I would be happy to see that conversion code removed from the API and moved deeper into the
project. That would help make the API more lightweight.

> `determineDataFrameDimensionsIfNeeded(...)` is a bottleneck.
> ------------------------------------------------------------
>                 Key: SYSTEMML-909
>                 URL:
>             Project: SystemML
>          Issue Type: Improvement
>            Reporter: Mike Dusenberry
> The {{[determineDataFrameDimensionsIfNeeded(...) |]}}
function in {{MLContext}} is a major bottleneck, particularly due to the `javaRDD` call.
> The issue I'm seeing is that the javaRDD.count() function causes execution of the lazy
DataFrames I pass in, which are created from another DataFrame via df.randomSplit([0.8, 0.2]),
thus a shuffle occurs. I know that this is going to happen anyways in the internal conversion,
but it wastes a lot of time by having to also do it in this step too. Assume that I have more
data than I can efficiently cache (~7TB with the potential for much more), so I need to incur
the shuffle step only once on the way into the engine.

This message was sent by Atlassian JIRA

View raw message