spark-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From zsampson <>
Subject DataFrame.withColumn very slow when used iteratively?
Date Tue, 02 Jun 2015 19:34:55 GMT

I'm seeing extreme slowness in withColumn when it's used in a loop. I'm
running this code:

for (int i = 0; i < NUM_ITERATIONS ++i) {
    df = df.withColumn("col"+i, new Column(new Literal(i,

where df is initially a trivial dataframe. Here are the results of running
with different values of NUM_ITERATIONS:

iterations	time
25	3s
50	11s
75	31s
100	76s
125	159s
150	283s

When I update the DataFrame by manually copying/appending to the column
array and using, it runs in about half the time, but this
is still untenable at any significant number of iterations.

Any insight?

View this message in context:
Sent from the Apache Spark Developers List mailing list archive at

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message