spark-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Baogang Wang (JIRA)" <j...@apache.org>
Subject [jira] [Comment Edited] (SPARK-10145) Executor exit without useful messages when spark runs in spark-streaming
Date Fri, 21 Aug 2015 03:27:46 GMT

    [ https://issues.apache.org/jira/browse/SPARK-10145?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14706139#comment-14706139
] 

Baogang Wang edited comment on SPARK-10145 at 8/21/15 3:27 AM:
---------------------------------------------------------------

spark.serializer                 org.apache.spark.serializer.KryoSerializer
spark.akka.frameSize                    1024
spark.driver.extraJavaOptions           -Dhdp.version=2.2.0.0–2041
spark.yarn.am.extraJavaOptions          -Dhdp.version=2.2.0.0–2041
spark.akka.timeout                      900
spark.storage.memoryFraction            0.4
spark.rdd.compress                      true
spark.shuffle.blockTransferService      nio
spark.yarn.executor.memoryOverhead      1024


was (Author: heayin):
# Default system properties included when running spark-submit.
# This is useful for setting default environmental settings.

# Example:
# spark.master                     spark://master:7077
# spark.eventLog.enabled           true
# spark.eventLog.dir               hdfs://namenode:8021/directory
spark.serializer                 org.apache.spark.serializer.KryoSerializer
# spark.driver.memory              5g
# spark.executor.extraJavaOptions  -XX:+PrintGCDetails -Dkey=value -Dnumbers="one two three"
#spark.core.connection.ack.wait.timeout 3600
#spark.core.connection.auth.wait.timeout        3600
spark.akka.frameSize                    1024
spark.driver.extraJavaOptions           -Dhdp.version=2.2.0.0–2041
spark.yarn.am.extraJavaOptions          -Dhdp.version=2.2.0.0–2041
spark.akka.timeout                      900
spark.storage.memoryFraction            0.4
spark.rdd.compress                      true
spark.shuffle.blockTransferService      nio
spark.yarn.executor.memoryOverhead      1024

> Executor exit without useful messages when spark runs in spark-streaming
> ------------------------------------------------------------------------
>
>                 Key: SPARK-10145
>                 URL: https://issues.apache.org/jira/browse/SPARK-10145
>             Project: Spark
>          Issue Type: Bug
>          Components: Streaming, YARN
>         Environment: spark 1.3.1, hadoop 2.6.0, 6 nodes, each node has 32 cores and 32g
memory  
>            Reporter: Baogang Wang
>            Priority: Critical
>   Original Estimate: 168h
>  Remaining Estimate: 168h
>
> Each node is allocated 30g memory by Yarn.
> My application receives messages from Kafka by directstream. Each application consists
of 4 dstream window
> Spark application is submitted by this command:
> spark-submit --class spark_security.safe.SafeSockPuppet  --driver-memory 3g --executor-memory
3g --num-executors 3 --executor-cores 4  --name safeSparkDealerUser --master yarn  --deploy-mode
cluster  spark_Security-1.0-SNAPSHOT.jar.nocalse hdfs://A01-R08-3-I160-102.JD.LOCAL:9000/spark_properties/safedealer.properties
> After about 1 hours, some executor exits. There is no more yarn logs after the executor
exits and there is no stack when the executor exits.
> When I see the yarn node manager log, it shows as follows :
> 2015-08-17 17:25:41,550 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl:
Start request for container_1439803298368_0005_01_000001 by user root
> 2015-08-17 17:25:41,551 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl:
Creating a new application reference for app application_1439803298368_0005
> 2015-08-17 17:25:41,551 INFO org.apache.hadoop.yarn.server.nodemanager.NMAuditLogger:
USER=root	IP=172.19.160.102	OPERATION=Start Container Request	TARGET=ContainerManageImpl	RESULT=SUCCESS
APPID=application_1439803298368_0005	CONTAINERID=container_1439803298368_0005_01_000001
> 2015-08-17 17:25:41,551 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.application.Application:
Application application_1439803298368_0005 transitioned from NEW to INITING
> 2015-08-17 17:25:41,552 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.application.Application:
Adding container_1439803298368_0005_01_000001 to application application_1439803298368_0005
> 2015-08-17 17:25:41,557 WARN org.apache.hadoop.yarn.server.nodemanager.containermanager.logaggregation.AppLogAggregatorImpl:
rollingMonitorInterval is set as -1. The log rolling mornitoring interval is disabled. The
logs will be aggregated after this application is finished.
> 2015-08-17 17:25:41,663 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.application.Application:
Application application_1439803298368_0005 transitioned from INITING to RUNNING
> 2015-08-17 17:25:41,664 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.container.Container:
Container container_1439803298368_0005_01_000001 transitioned from NEW to LOCALIZING
> 2015-08-17 17:25:41,664 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.AuxServices:
Got event CONTAINER_INIT for appId application_1439803298368_0005
> 2015-08-17 17:25:41,664 INFO org.apache.spark.network.yarn.YarnShuffleService: Initializing
container container_1439803298368_0005_01_000001
> 2015-08-17 17:25:41,665 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.LocalizedResource:
Resource hdfs://A01-R08-3-I160-102.JD.LOCAL:9000/user/root/.sparkStaging/application_1439803298368_0005/spark-assembly-1.3.1-hadoop2.6.0.jar
transitioned from INIT to DOWNLOADING
> 2015-08-17 17:25:41,665 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.LocalizedResource:
Resource hdfs://A01-R08-3-I160-102.JD.LOCAL:9000/user/root/.sparkStaging/application_1439803298368_0005/spark_Security-1.0-SNAPSHOT.jar
transitioned from INIT to DOWNLOADING
> 2015-08-17 17:25:41,665 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService:
Created localizer for container_1439803298368_0005_01_000001
> 2015-08-17 17:25:41,668 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService:
Writing credentials to the nmPrivate file /export/servers/hadoop2.6.0/tmp/nm-local-dir/nmPrivate/container_1439803298368_0005_01_000001.tokens.
Credentials list: 
> 2015-08-17 17:25:41,682 INFO org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor:
Initializing user root
> 2015-08-17 17:25:41,686 INFO org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor:
Copying from /export/servers/hadoop2.6.0/tmp/nm-local-dir/nmPrivate/container_1439803298368_0005_01_000001.tokens
to /export/servers/hadoop2.6.0/tmp/nm-local-dir/usercache/root/appcache/application_1439803298368_0005/container_1439803298368_0005_01_000001.tokens
> 2015-08-17 17:25:41,686 INFO org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor:
Localizer CWD set to /export/servers/hadoop2.6.0/tmp/nm-local-dir/usercache/root/appcache/application_1439803298368_0005
= file:/export/servers/hadoop2.6.0/tmp/nm-local-dir/usercache/root/appcache/application_1439803298368_0005
> 2015-08-17 17:25:42,240 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.LocalizedResource:
Resource hdfs://A01-R08-3-I160-102.JD.LOCAL:9000/user/root/.sparkStaging/application_1439803298368_0005/spark-assembly-1.3.1-hadoop2.6.0.jar(->/export/servers/hadoop2.6.0/tmp/nm-local-dir/usercache/root/filecache/14/spark-assembly-1.3.1-hadoop2.6.0.jar)
transitioned from DOWNLOADING to LOCALIZED
> 2015-08-17 17:25:42,508 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.LocalizedResource:
Resource hdfs://A01-R08-3-I160-102.JD.LOCAL:9000/user/root/.sparkStaging/application_1439803298368_0005/spark_Security-1.0-SNAPSHOT.jar(->/export/servers/hadoop2.6.0/tmp/nm-local-dir/usercache/root/filecache/15/spark_Security-1.0-SNAPSHOT.jar)
transitioned from DOWNLOADING to LOCALIZED
> 2015-08-17 17:25:42,508 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.container.Container:
Container container_1439803298368_0005_01_000001 transitioned from LOCALIZING to LOCALIZED
> 2015-08-17 17:25:42,548 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.container.Container:
Container container_1439803298368_0005_01_000001 transitioned from LOCALIZED to RUNNING
> ................................................
> 2015-08-17 17:26:20,366 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl:
Start request for container_1439803298368_0005_01_000003 by user root
> 2015-08-17 17:26:20,367 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.application.Application:
Adding container_1439803298368_0005_01_000003 to application application_1439803298368_0005
> 2015-08-17 17:26:20,368 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.container.Container:
Container container_1439803298368_0005_01_000003 transitioned from NEW to LOCALIZING
> 2015-08-17 17:26:20,368 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.AuxServices:
Got event CONTAINER_INIT for appId application_1439803298368_0005
> 2015-08-17 17:26:20,368 INFO org.apache.spark.network.yarn.YarnShuffleService: Initializing
container container_1439803298368_0005_01_000003
> 2015-08-17 17:26:20,369 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.container.Container:
Container container_1439803298368_0005_01_000003 transitioned from LOCALIZING to LOCALIZED
> 2015-08-17 17:26:20,370 INFO org.apache.hadoop.yarn.server.nodemanager.NMAuditLogger:
USER=root	IP=172.19.160.102	OPERATION=Start Container Request	TARGET=ContainerManageImpl	RESULT=SUCCESS
APPID=application_1439803298368_0005	CONTAINERID=container_1439803298368_0005_01_000003
> 2015-08-17 17:26:20,443 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.container.Container:
Container container_1439803298368_0005_01_000003 transitioned from LOCALIZED to RUNNING
> 2015-08-17 17:26:20,443 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl:
Neither virutal-memory nor physical-memory monitoring is needed. Not running the monitor-thread
> 2015-08-17 17:26:20,449 INFO org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor:
launchContainer: [bash, /export/servers/hadoop2.6.0/tmp/nm-local-dir/usercache/root/appcache/application_1439803298368_0005/container_1439803298368_0005_01_000003/default_container_executor.sh]
> ..........................................
>    
> 2015-08-18 01:50:30,297 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch:
Container container_1439803298368_0005_01_000003 succeeded 
> 2015-08-18 01:50:30,440 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.container.Container:
Container container_1439803298368_0005_01_000003 transitioned from RUNNING to EXITED_WITH_SUCCESS
> 2015-08-18 01:50:30,465 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch:
Cleaning up container container_1439803298368_0005_01_000003
> 2015-08-18 01:50:35,046 INFO org.apache.hadoop.yarn.server.nodemanager.NMAuditLogger:
USER=root	OPERATION=Container Finished - Succeeded	TARGET=ContainerImpl	RESULT=SUCCESS	APPID=application_1439803298368_0005
CONTAINERID=container_1439803298368_0005_01_000003
> 2015-08-18 01:50:35,062 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.container.Container:
Container container_1439803298368_0005_01_000003 transitioned from EXITED_WITH_SUCCESS to
DONE
> 2015-08-18 01:50:35,065 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.application.Application:
Removing container_1439803298368_0005_01_000003 from application application_1439803298368_0005
> 2015-08-18 01:50:35,070 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl:
Neither virutal-memory nor physical-memory monitoring is needed. Not running the monitor-thread
> 2015-08-18 01:50:35,082 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.logaggregation.AppLogAggregatorImpl:
Considering container container_1439803298368_0005_01_000003 for log-aggregation
> 2015-08-18 01:50:35,089 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.AuxServices:
Got event CONTAINER_STOP for appId application_1439803298368_0005
> 2015-08-18 01:50:35,099 INFO org.apache.spark.network.yarn.YarnShuffleService: Stopping
container container_1439803298368_0005_01_000003
> 2015-08-18 01:50:35,105 INFO org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor:
Deleting absolute path : /export/servers/hadoop2.6.0/tmp/nm-local-dir/usercache/root/appcache/application_1439803298368_0005/container_1439803298368_0005_01_000003
> 2015-08-18 01:50:47,601 WARN org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor:
Exit code from container container_1439803298368_0005_01_000001 is : 15
> 2015-08-18 01:50:48,401 WARN org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor:
Exception from container-launch with container ID: container_1439803298368_0005_01_000001
and exit code: 15
> ExitCodeException exitCode=15: 
> 	at org.apache.hadoop.util.Shell.runCommand(Shell.java:538)
> 	at org.apache.hadoop.util.Shell.run(Shell.java:455)
> 	at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:715)
> 	at org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.launchContainer(DefaultContainerExecutor.java:211)
> 	at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:302)
> 	at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:82)
> 	at java.util.concurrent.FutureTask.run(FutureTask.java:262)
> 	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
> 	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
> 	at java.lang.Thread.run(Thread.java:745)
>     container_1439803298368_0005_01_000003 was started at 2015-08-17 17:26:20. It ran
normally. But it transitioned  to succeed at  2015-08-18 01:50:30 . And it transitioned to
CONTAINER_STOP in the end.    container_1439803298368_0005_01_000001 was started at 2015-08-17
17:25:42. At 2015-08-18 01:50:48 it exited suddenly.
> According to the node manager ,we can know that container_1439803298368_0005_01_000003
transitioned from RUNNING to EXITED_WITH_SUCCESS



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org


Mime
View raw message