spark-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From sririshindra <sririshin...@gmail.com>
Subject Re: Output Committers for S3
Date Sat, 17 Jun 2017 04:35:00 GMT
Hi Ryan and Steve,

Thanks very much for your reply.

I was finally able to get Ryan's repo work for me by changing the output
committer to FileOutputFormat instead of ParquetOutputCommitter in spark as
Steve suggested. 

However, It is not working for append mode while saving the data frame. 

    val hf =
spark.read.parquet("/home/user/softwares/spark-2.1.0-bin-hadoop2.7/examples/src/main/resources/users.parquet")

    hf.persist(StorageLevel.DISK_ONLY)
    hf.show()
    hf.write
      .partitionBy("name").mode("append")
      .save(S3Location + "data" + ".parquet")



The above code is successfully saving the parquet file when I am running it
for the first time. But When I rerun the code again the new parquet files
are not getting added to s3

I have put a print statement in the constructors of
PartitionedOutputCommiter in Ryan's repo and realized that the partitioned
output committer is not even getting called the second time I ran the code.
It is being called only for the first time. Is there anything that I can do
to make spark call the PartitionedOutputCommiter even when the file already
exists in s3?






--
View this message in context: http://apache-spark-developers-list.1001551.n3.nabble.com/Output-Committers-for-S3-tp21033p21776.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe e-mail: dev-unsubscribe@spark.apache.org


Mime
View raw message