spark-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Anbu Cheeralan (JIRA)" <>
Subject [jira] [Commented] (SPARK-17493) Spark Job hangs while DataFrame writing to HDFS path with parquet mode
Date Thu, 15 Dec 2016 20:34:59 GMT


Anbu Cheeralan commented on SPARK-17493:

[~sowen] I faced a similar error while writing to google storage. This issue is specific while
writing to object stores. This happens in append mode.

In org.apache.spark.sql.execution.datasources.DataSource.write() following code causes huge
number of RPC calls when the file system is on Object Stores (S3, GS). 
          if (mode == SaveMode.Append) {
            val existingPartitionColumns = Try {
There should be a flag to skip Partition Match Check in append mode. I can work on the patch.

> Spark Job hangs while DataFrame writing to HDFS path with parquet mode
> ----------------------------------------------------------------------
>                 Key: SPARK-17493
>                 URL:
>             Project: Spark
>          Issue Type: Bug
>          Components: Input/Output
>    Affects Versions: 2.0.0
>         Environment: AWS Cluster
>            Reporter: Gautam Solanki
> While saving a RDD to HDFS path in parquet format with the following rddout.write.partitionBy("event_date").mode(org.apache.spark.sql.SaveMode.Append).parquet("hdfs:////tmp//rddout_parquet_full_hdfs1//")
, the spark job was hanging as the two write tasks with Shuffle Read of size 0 could not complete.
But, the executors notified the driver about the completion of these two tasks. 

This message was sent by Atlassian JIRA

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message