spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ted Yu <yuzhih...@gmail.com>
Subject Re: Improve parquet write speed to HDFS and spark.sql.execution.id is already set ERROR
Date Tue, 03 Nov 2015 12:48:21 GMT
I am a bit curious: why is the synchronization on finalLock is needed ?

Thanks

> On Oct 23, 2015, at 8:25 AM, Anubhav Agarwal <anubhav33@gmail.com> wrote:
> 
> I have a spark job that creates 6 million rows in RDDs. I convert the RDD into Data-frame
and write it to HDFS. Currently it takes 3 minutes to write it to HDFS.
> 
> Here is the snippet:-
> RDDList.parallelStream().forEach(mapJavaRDD -> {
>                     if (mapJavaRDD != null) {
>                         JavaRDD<Row> rowRDD = mapJavaRDD.mapPartitionsWithIndex((integer,
v2) -> {
>                             <logical operation>
>                             return new ArrayList<Row>(1).iterator();
>                         }, false);
> 
>                         DataFrame dF = sqlContext.createDataFrame(rowRDD, schema).coalesce(3);
>                         synchronized (finalLock) {
>                             dF.write().mode(SaveMode.Append).parquet("hdfs location");
>                         }
> 
>                 });
> 
> After looking into the logs I know the following is the reason for the job taking too
long:-
>                             dF.write().mode(SaveMode.Append).parquet("hdfs location");
> 
> I also get the following errors due to it:-
> 15/10/21 21:12:30 WARN scheduler.TaskSetManager: Stage 31 contains a task of very large
size (378 KB). The maximum recommended task size is 100 KB.4 of these kind of warnings appeared.
> 
> java.lang.IllegalArgumentException: java.lang.IllegalArgumentException: spark.sql.execution.id
is already set 

Mime
View raw message