spark-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Guoqiang Li (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (SPARK-2398) Trouble running Spark 1.0 on Yarn
Date Tue, 08 Jul 2014 02:24:34 GMT

    [ https://issues.apache.org/jira/browse/SPARK-2398?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14054434#comment-14054434
] 

Guoqiang Li commented on SPARK-2398:
------------------------------------

Seems to be related to [SPARK-1930|https://issues.apache.org/jira/browse/SPARK-1930].
Can you post the yarn node manager log?

> Trouble running Spark 1.0 on Yarn 
> ----------------------------------
>
>                 Key: SPARK-2398
>                 URL: https://issues.apache.org/jira/browse/SPARK-2398
>             Project: Spark
>          Issue Type: Bug
>          Components: Spark Core
>    Affects Versions: 1.0.0
>            Reporter: Nishkam Ravi
>
> Trouble running workloads in Spark-on-YARN cluster mode for Spark 1.0. 
> For example: SparkPageRank when run in standalone mode goes through without any errors
(tested for up to 30GB input dataset on a 6-node cluster).  Also runs fine for a 1GB dataset
in yarn cluster mode. Starts to choke (in yarn cluster mode) as the input data size is increased.
Confirmed for 16GB input dataset.
> The same workload runs fine with Spark 0.9 in both standalone and yarn cluster mode (for
up to 30 GB input dataset on a 6-node cluster).
> Commandline used:
> (/opt/cloudera/parcels/CDH/lib/spark/bin/spark-submit --master yarn --deploy-mode cluster
--properties-file pagerank.conf  --driver-memory 30g --driver-cores 16 --num-executors 5 --class
org.apache.spark.examples.SparkPageRank /opt/cloudera/parcels/CDH/lib/spark/examples/lib/spark-examples_2.10-1.0.0-cdh5.1.0-SNAPSHOT.jar
pagerank_in $NUM_ITER)
> pagerank.conf:
> spark.master            spark://c1704.halxg.cloudera.com:7077
> spark.home      /opt/cloudera/parcels/CDH/lib/spark
> spark.executor.memory   32g
> spark.default.parallelism       118
> spark.cores.max 96
> spark.storage.memoryFraction    0.6
> spark.shuffle.memoryFraction    0.3
> spark.shuffle.compress  true
> spark.shuffle.spill.compress    true
> spark.broadcast.compress        true
> spark.rdd.compress      false
> spark.io.compression.codec      org.apache.spark.io.LZFCompressionCodec
> spark.io.compression.snappy.block.size  32768
> spark.reducer.maxMbInFlight     48
> spark.local.dir  /var/lib/jenkins/workspace/tmp
> spark.driver.memory     30g
> spark.executor.cores    16
> spark.locality.wait     6000
> spark.executor.instances        5
> UI shows ExecutorLostFailure. Yarn logs contain numerous exceptions:
> 14/07/07 17:59:49 WARN network.SendingConnection: Error writing in connection to ConnectionManagerId(a1016.halxg.cloudera.com,54105)
> java.nio.channels.AsynchronousCloseException
>         at java.nio.channels.spi.AbstractInterruptibleChannel.end(AbstractInterruptibleChannel.java:205)
>         at sun.nio.ch.SocketChannelImpl.write(SocketChannelImpl.java:496)
>         at org.apache.spark.network.SendingConnection.write(Connection.scala:361)
>         at org.apache.spark.network.ConnectionManager$$anon$5.run(ConnectionManager.scala:142)
>         at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>         at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>         at java.lang.Thread.run(Thread.java:745)
> --------
> java.io.IOException: Filesystem closed
>         at org.apache.hadoop.hdfs.DFSClient.checkOpen(DFSClient.java:703)
>         at org.apache.hadoop.hdfs.DFSInputStream.close(DFSInputStream.java:619)
>         at java.io.FilterInputStream.close(FilterInputStream.java:181)
>         at org.apache.hadoop.util.LineReader.close(LineReader.java:150)
>         at org.apache.hadoop.mapred.LineRecordReader.close(LineRecordReader.java:244)
>         at org.apache.spark.rdd.HadoopRDD$$anon$1.close(HadoopRDD.scala:226)
>         at org.apache.spark.util.NextIterator.closeIfNeeded(NextIterator.scala:63)
>         at org.apache.spark.rdd.HadoopRDD$$anon$1$$anonfun$1.apply$mcV$sp(HadoopRDD.scala:197)
>         at org.apache.spark.TaskContext$$anonfun$executeOnCompleteCallbacks$1.apply(TaskContext.scala:63)
>         at org.apache.spark.TaskContext$$anonfun$executeOnCompleteCallbacks$1.apply(TaskContext.scala:63)
>         at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
>         at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
>         at org.apache.spark.TaskContext.executeOnCompleteCallbacks(TaskContext.scala:63)
>         at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:156)
>         at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:97)
>         at org.apache.spark.scheduler.Task.run(Task.scala:51)
>         at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:187)
>         at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>         at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>         at java.lang.Thread.run(Thread.java:745)
> -------
> 14/07/07 17:59:52 WARN network.SendingConnection: Error finishing connection to a1016.halxg.cloudera.com/10.20.184.116:54105
> java.net.ConnectException: Connection refused
>         at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
>         at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:739)
>         at org.apache.spark.network.SendingConnection.finishConnect(Connection.scala:313)
>         at org.apache.spark.network.ConnectionManager$$anon$7.run(ConnectionManager.scala:203)
>         at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>         at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>         at java.lang.Thread.run(Thread.java:745)



--
This message was sent by Atlassian JIRA
(v6.2#6252)

Mime
View raw message