spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Sumit Khanna <sumit.kha...@askme.in>
Subject Re: how to save spark files as parquets efficiently
Date Fri, 29 Jul 2016 10:07:46 GMT
Hey,

So I believe this is the right format to save the file, as in optimization
is never in the write part, but with the head / body of my execution plan
isnt it?

Thanks,

On Fri, Jul 29, 2016 at 11:57 AM, Sumit Khanna <sumit.khanna@askme.in>
wrote:

> Hey,
>
> master=yarn
> mode=cluster
>
> spark.executor.memory=8g
> spark.rpc.netty.dispatcher.numThreads=2
>
> All the POC on a single node cluster. the biggest bottle neck being :
>
> 1.8 hrs to save 500k records as a parquet file/dir executing this command :
>
> df.write.format("parquet").mode("overwrite").save(hdfspathTemp)
>
>
> No doubt, the whole execution plan gets triggered on this write / save
> action. But is it the right command / set of params to save a dataframe?
>
> essentially I am doing an upsert by pulling in data from hdfs and then
> updating it with the delta changes of the current run. But not sure if
> write itself takes that much time or some optimization is needed for
> upsert. (I have that asked as another question altogether).
>
> Thanks,
> Sumit
>
>

Mime
View raw message