spark-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From zsampson <zsamp...@palantir.com>
Subject DataFrame.withColumn very slow when used iteratively?
Date Tue, 02 Jun 2015 19:34:55 GMT
Hey,

I'm seeing extreme slowness in withColumn when it's used in a loop. I'm
running this code:

for (int i = 0; i < NUM_ITERATIONS ++i) {
    df = df.withColumn("col"+i, new Column(new Literal(i,
DataTypes.IntegerType)));
}

where df is initially a trivial dataframe. Here are the results of running
with different values of NUM_ITERATIONS:

iterations	time
25	3s
50	11s
75	31s
100	76s
125	159s
150	283s

When I update the DataFrame by manually copying/appending to the column
array and using DataFrame.select, it runs in about half the time, but this
is still untenable at any significant number of iterations.

Any insight?



--
View this message in context: http://apache-spark-developers-list.1001551.n3.nabble.com/DataFrame-withColumn-very-slow-when-used-iteratively-tp12562.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@spark.apache.org
For additional commands, e-mail: dev-help@spark.apache.org


Mime
View raw message