spark-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Nishkam Ravi (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (SPARK-2398) Trouble running Spark 1.0 on Yarn
Date Fri, 11 Jul 2014 23:31:06 GMT

    [ https://issues.apache.org/jira/browse/SPARK-2398?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14059489#comment-14059489
] 

Nishkam Ravi commented on SPARK-2398:
-------------------------------------

[~gq] [~sowen]  I don't specify XmX, I'm only requesting Yarn container of a certain size.
If I specified both, I'm talking to JVM and Yarn at the same time and potentially sending
inconsistent messages. If I only specify container size, Yarn should take care of this without
bothering the developer (i.e., allocate specified_container_size + memory_overhead, where
memory_overhead = f(specified_container_size)). Ideally.

Double checked and made sure that all config parameters are identical between 0.9 and 1.0
deployment. I suspect something has changed for the worse. I can do some further diagnosis
by redeploying 0.9 and looking at nodemanager logs. Increasing spark.yarn.executor.memoryOverhead
fixes this problem.

> Trouble running Spark 1.0 on Yarn 
> ----------------------------------
>
>                 Key: SPARK-2398
>                 URL: https://issues.apache.org/jira/browse/SPARK-2398
>             Project: Spark
>          Issue Type: Bug
>          Components: Spark Core
>    Affects Versions: 1.0.0
>            Reporter: Nishkam Ravi
>
> Trouble running workloads in Spark-on-YARN cluster mode for Spark 1.0. 
> For example: SparkPageRank when run in standalone mode goes through without any errors
(tested for up to 30GB input dataset on a 6-node cluster).  Also runs fine for a 1GB dataset
in yarn cluster mode. Starts to choke (in yarn cluster mode) as the input data size is increased.
Confirmed for 16GB input dataset.
> The same workload runs fine with Spark 0.9 in both standalone and yarn cluster mode (for
up to 30 GB input dataset on a 6-node cluster).
> Commandline used:
> (/opt/cloudera/parcels/CDH/lib/spark/bin/spark-submit --master yarn --deploy-mode cluster
--properties-file pagerank.conf  --driver-memory 30g --driver-cores 16 --num-executors 5 --class
org.apache.spark.examples.SparkPageRank /opt/cloudera/parcels/CDH/lib/spark/examples/lib/spark-examples_2.10-1.0.0-cdh5.1.0-SNAPSHOT.jar
pagerank_in $NUM_ITER)
> pagerank.conf:
> spark.master            spark://c1704.halxg.cloudera.com:7077
> spark.home      /opt/cloudera/parcels/CDH/lib/spark
> spark.executor.memory   32g
> spark.default.parallelism       118
> spark.cores.max 96
> spark.storage.memoryFraction    0.6
> spark.shuffle.memoryFraction    0.3
> spark.shuffle.compress  true
> spark.shuffle.spill.compress    true
> spark.broadcast.compress        true
> spark.rdd.compress      false
> spark.io.compression.codec      org.apache.spark.io.LZFCompressionCodec
> spark.io.compression.snappy.block.size  32768
> spark.reducer.maxMbInFlight     48
> spark.local.dir  /var/lib/jenkins/workspace/tmp
> spark.driver.memory     30g
> spark.executor.cores    16
> spark.locality.wait     6000
> spark.executor.instances        5
> UI shows ExecutorLostFailure. Yarn logs contain numerous exceptions:
> 14/07/07 17:59:49 WARN network.SendingConnection: Error writing in connection to ConnectionManagerId(a1016.halxg.cloudera.com,54105)
> java.nio.channels.AsynchronousCloseException
>         at java.nio.channels.spi.AbstractInterruptibleChannel.end(AbstractInterruptibleChannel.java:205)
>         at sun.nio.ch.SocketChannelImpl.write(SocketChannelImpl.java:496)
>         at org.apache.spark.network.SendingConnection.write(Connection.scala:361)
>         at org.apache.spark.network.ConnectionManager$$anon$5.run(ConnectionManager.scala:142)
>         at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>         at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>         at java.lang.Thread.run(Thread.java:745)
> --------
> java.io.IOException: Filesystem closed
>         at org.apache.hadoop.hdfs.DFSClient.checkOpen(DFSClient.java:703)
>         at org.apache.hadoop.hdfs.DFSInputStream.close(DFSInputStream.java:619)
>         at java.io.FilterInputStream.close(FilterInputStream.java:181)
>         at org.apache.hadoop.util.LineReader.close(LineReader.java:150)
>         at org.apache.hadoop.mapred.LineRecordReader.close(LineRecordReader.java:244)
>         at org.apache.spark.rdd.HadoopRDD$$anon$1.close(HadoopRDD.scala:226)
>         at org.apache.spark.util.NextIterator.closeIfNeeded(NextIterator.scala:63)
>         at org.apache.spark.rdd.HadoopRDD$$anon$1$$anonfun$1.apply$mcV$sp(HadoopRDD.scala:197)
>         at org.apache.spark.TaskContext$$anonfun$executeOnCompleteCallbacks$1.apply(TaskContext.scala:63)
>         at org.apache.spark.TaskContext$$anonfun$executeOnCompleteCallbacks$1.apply(TaskContext.scala:63)
>         at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
>         at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
>         at org.apache.spark.TaskContext.executeOnCompleteCallbacks(TaskContext.scala:63)
>         at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:156)
>         at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:97)
>         at org.apache.spark.scheduler.Task.run(Task.scala:51)
>         at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:187)
>         at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>         at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>         at java.lang.Thread.run(Thread.java:745)
> -------
> 14/07/07 17:59:52 WARN network.SendingConnection: Error finishing connection to a1016.halxg.cloudera.com/10.20.184.116:54105
> java.net.ConnectException: Connection refused
>         at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
>         at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:739)
>         at org.apache.spark.network.SendingConnection.finishConnect(Connection.scala:313)
>         at org.apache.spark.network.ConnectionManager$$anon$7.run(ConnectionManager.scala:203)
>         at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>         at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>         at java.lang.Thread.run(Thread.java:745)



--
This message was sent by Atlassian JIRA
(v6.2#6252)

Mime
View raw message