spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Anantharaman, Srinatha (Contractor)" <Srinatha_Ananthara...@comcast.com>
Subject RE: Spark consumes more memory
Date Thu, 11 May 2017 21:05:06 GMT
Rick,

Thank you for the input. Now space issue is resolved. yarn.nodemanager.local.dirs and yarn.nodemanager.log.dirs
was filling up.

For 5Gb of data why it should take 10 mins to load with 7-8 executors with 2 cores and I also
see all the executors memory is upto 7-20 GB
If 5 GB of data takes so much resources what will happen if I load 50 GB of data

I tried reducing the partitions to 64 but it takes more than 10 mins.

Is there any configuration which help me to improve loading process and consume less memory?

Regards,
~Sri

From: Rick Moritz [mailto:rahvin@gmail.com]
Sent: Thursday, May 11, 2017 1:34 PM
To: Anantharaman, Srinatha (Contractor) <Srinatha_Anantharaman@comcast.com>; user <user@spark.apache.org>
Subject: Re: Spark consumes more memory

I would try to track down the "no space left on device" - find out where that originates from,
since you should be able to allocate 10 executors with 4 cores and 15GB RAM each quite easily.
In that case,you may want to increase overhead, so yarn doesn't kill your executors.
Check that no local drives are filling up with temporary data, by runnning a watch df on all
nodes,
Also check that no quotas are being enforced, and that your log-partitions aren't flowing
over.

Depending on your disk and network speed, as well as the time it takes yarn to allocate resources
and spark to initialize the spark context, 10 minutes doesn't sound too bad. Also, I don't
think 150 partitions are a helpful partition size, if you have 7G RAM per executor, and aren't
doing any joining or other memory intensive calculation. Try again with 64 partitions, and
see if the reduced overhead helps.
Also, track which action/task are running longer than expected in SparkUI. That sohuld help
ID where your bottleneck is located.

On Thu, May 11, 2017 at 5:46 PM, Anantharaman, Srinatha (Contractor) <Srinatha_Anantharaman@comcast.com<mailto:Srinatha_Anantharaman@comcast.com>>
wrote:
Hi,

I am reading a Hive Orc table into memory, StorageLevel is set to (StorageLevel.MEMORY_AND_DISK_SER)
Total size of the Hive table is 5GB
Started the spark-shell as below

spark-shell --master yarn --deploy-mode client --num-executors 8 --driver-memory 5G --executor-memory
7G --executor-cores 2 --conf spark.yarn.executor.memoryOverhead=512
I have 10 node cluster each with 35 GB memory and 4 cores running on HDP 2.5
SPARK_LOCAL_DIRS location has enough space

My concern is below simple code to load data to memory takes approx. 10-12 mins.
If I change values for num-executors/driver-memory/executor-memory/executor-cores other than
above mentioned I get “No space left on device” error
While running each nodes consumes varying size of memory from 7GB to 20 GB

import org.apache.spark.storage.StorageLevel


val sqlContext = new org.apache.spark.sql.hive.HiveContext(sc)
sqlContext.sql("SET hive.mapred.supports.subdirectories=true")
sqlContext.sql("SET mapreduce.input.fileinputformat.input.dir.recursive=true")
val tab1 =  sqlContext.sql("select * from xyz").repartition(150).persist(StorageLevel.MEMORY_AND_DISK_SER)
tab1.registerTempTable("AUDIT")
tab1.count()

kindly advice how to improve the performance of loading Hive table to Spark memory and avoid
the space issue

Regards,
~Sri

Mime
View raw message