spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From morfious902002 <>
Subject Improve parquet write speed to HDFS and is already set ERROR
Date Fri, 23 Oct 2015 15:57:42 GMT
I have a spark job that creates 6 million rows in RDDs. I convert the RDD
into Data-frame and write it to HDFS. Currently it takes 3 minutes to write
it to HDFS.
I am using spark 1.5.1 with YARN.

Here is the snippet:-
RDDList.parallelStream().forEach(mapJavaRDD -> {
                    if (mapJavaRDD != null) {
                        JavaRDD<Row> rowRDD =
mapJavaRDD.mapPartitionsWithIndex((integer, v2) -> {
                            <logical operation>
                            return new ArrayList<Row>(1).iterator();
                        }, false);

                        DataFrame dF = sqlContext.createDataFrame(rowRDD,
                        synchronized (finalLock) {


After looking into the logs I know the following is the reason for the job
taking too long:-

I also get the following errors due to it:-
15/10/21 21:12:30 WARN scheduler.TaskSetManager: Stage 31 contains a task of
very large size (378 KB). The maximum recommended task size is 100 KB.4 of
these kind of warnings appeared.

java.lang.IllegalArgumentException: java.lang.IllegalArgumentException: is already set

View this message in context:
Sent from the Apache Spark User List mailing list archive at

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message