spark-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From alexandre Clement <a.p.clem...@gmail.com>
Subject Re: withColumn is very slow with datasets with large number of columns
Date Thu, 30 Apr 2015 14:37:56 GMT
I have reported the issue on JIRA:
https://issues.apache.org/jira/browse/SPARK-7276

On Thu, Apr 30, 2015 at 4:36 PM, alexandre Clement <a.p.clement@gmail.com>
wrote:

> Hi all,
>
>
> I'm experimenting serious performance problem when using withColumn and
> dataset with large number of columns. It is very slow: on a dataset with
> 100 columns it takes a few seconds.
>
>
> The code snippet demonstrates the problem.
>
>
> val custs = Seq(
> Row(1, "Bob", 21, 80.5),
> Row(2, "Bobby", 21, 80.5),
> Row(3, "Jean", 21, 80.5),
> Row(4, "Fatime", 21, 80.5)
> )
>
> var fields = List(
> StructField("id", IntegerType, true),
> StructField("a", IntegerType, true),
> StructField("b", StringType, true),
> StructField("target", DoubleType, false))
> val schema = StructType(fields)
>
> var rdd = sc.parallelize(custs)
> var df = sqlContext.createDataFrame(rdd, schema)
>
> for (i <- 1 to 200)
> { val now = System.currentTimeMillis df = df.withColumn("a_new_col_" + i,
> df("a") + i) println(s"$i -> " + (System.currentTimeMillis - now)) }
>
> df.show()
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message