spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jerry Lam <chiling...@gmail.com>
Subject Re: DataFrameWriter on partitionBy for parquet eat all RAM
Date Sat, 16 Jan 2016 00:43:25 GMT
Hi Michael,

Thanks for sharing the tip. It will help to the write path of the partitioned table. 
Do you have similar suggestion on reading the partitioned table back when there is a million
of distinct values on the partition field (for example on user id)? Last time I have trouble
to read a partitioned table because it takes very long (over hours on s3) to execute the sqlcontext.read.parquet("partitioned_table").

Best Regards,

Jerry

Sent from my iPhone

> On 15 Jan, 2016, at 3:59 pm, Michael Armbrust <michael@databricks.com> wrote:
> 
> See here for some workarounds: https://issues.apache.org/jira/browse/SPARK-12546
> 
>> On Thu, Jan 14, 2016 at 6:46 PM, Jerry Lam <chilinglam@gmail.com> wrote:
>> Hi Arkadiusz,
>> 
>> the partitionBy is not designed to have many distinct value the last time I used
it. If you search in the mailing list, I think there are couple of people also face similar
issues. For example, in my case, it won't work over a million distinct user ids. It will require
a lot of memory and very long time to read the table back. 
>> 
>> Best Regards,
>> 
>> Jerry
>> 
>>> On Thu, Jan 14, 2016 at 2:31 PM, Arkadiusz Bicz <arkadiusz.bicz@gmail.com>
wrote:
>>> Hi
>>> 
>>> What is the proper configuration for saving parquet partition with
>>> large number of repeated keys?
>>> 
>>> On bellow code I load 500 milion rows of data and partition it on
>>> column with not so many different values.
>>> 
>>> Using spark-shell with 30g per executor and driver and 3 executor cores
>>> 
>>> sqlContext.read.load("hdfs://notpartitioneddata").write.partitionBy("columnname").parquet("partitioneddata")
>>> 
>>> 
>>> Job failed because not enough memory in executor :
>>> 
>>> WARN YarnSchedulerBackend$YarnSchedulerEndpoint: Container killed by
>>> YARN for exceeding memory limits. 43.5 GB of 43.5 GB physical memory
>>> used. Consider boosting spark.yarn.executor.memoryOverhead.
>>> 16/01/14 17:32:38 ERROR YarnScheduler: Lost executor 11 on
>>> datanode2.babar.poc: Container killed by YARN for exceeding memory
>>> limits. 43.5 GB of 43.5 GB physical memory used. Consider boosting
>>> spark.yarn.executor.memoryOverhead.
>>> 
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
>>> For additional commands, e-mail: user-help@spark.apache.org
> 

Mime
View raw message