spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "刚" <949294...@qq.com>
Subject How should I do to solve this problem that the executors of my spark application always is blocked after an executor is lost?
Date Mon, 17 Aug 2015 03:35:46 GMT
Hi guys:    I run 9 applications in my spark-cluster at the same time. They all run well in
the beginning. But after several hours, some applications lost one executor, and other executors
are blocked. By the way, I am using spark-streaming to analysis real-time messages. The screenshots
are as follows.

                                                         Figure1: The stage has lasted for
long time after one executor is lost





                                                  Figure2:The task info of the stage that
has last for long time after one executor is lost


The command that I submit an application is as follows:
spark-submit --class spark_security.login_users.Sockpuppet  --driver-memory 3g --executor-memory
3g --num-executors 3 --executor-cores 4  --name pcLoginSparkDealerUser --master yarn  --deploy-mode
cluster  spark_Security-1.0-SNAPSHOT.jar hdfs://A01-R08-3-I160-102.JD.LOCAL:9000/spark_properties/logindelaer.properties

     
Another 8 applications are submitted to use the same driver-memory, executor-memory, num-executors,
executor-cores. And they are all run in cluster mode.


When the problem happens, I got the yarn logs use the  command as follows:


yarn logs -application application_1439457182724_0026


I can not find any stack of exception. But I find the information as follows:
15/08/17 00:32:53 INFO streaming.CheckpointWriter: Saving checkpoint for time 1439472653000
ms to file 'hdfs://A01-R08-3-I160-102.JD.LOCAL:9000/regeditCountSparkDealerUser/checkpoint/checkpoint-1439472653000'
15/08/17  00:32:53 INFO streaming.CheckpointWriter: Deleting hdfs://A01-R08-3-I160-102.JD.LOCAL:9000/safemodpasswd/checkpoint/checkpoint-1439472643000
15/08/17  00:32:53 INFO streaming.CheckpointWriter: Checkpoint for time 1439472653000 ms saved
to file 'hdfs://A01-R08-3-I160-102.JD.LOCAL:9000/regeditCountSparkDealerUser/checkpoint/checkpoint-1439472653000',
took 473939 bytes and 65 ms
15/08/17  00:32:53 INFO transport.ProtocolStateActor: No response from remote. Handshake timed
out or transport failure detector triggered.
15/08/17  00:32:53 ERROR cluster.YarnClusterScheduler: Lost executor 5 on A01-R08-2-I160-115.JD.LOCAL:
remote Akka client disassociated
15/08/17  00:32:53 WARN remote.ReliableDeliverySupervisor: Association with remote system
[akka.tcp://sparkExecutor@A01-R08-2-I160-115.JD.LOCAL:48922] has failed, address is now gated
for [5000] ms. Reason is: [Disassociated].
15/08/17 00:32:54 INFO scheduler.TaskSetManager: Re-queueing tasks for 3 from TaskSet 3719.0
15/08/17 00:32:54 INFO dstream.FilteredDStream: Time 1439472654000 ms is invalid as zeroTime
is 1439457657000 ms and slideDuration is 15000 ms and difference is 14997000 ms
15/08/17 00:32:54 INFO dstream.FilteredDStream: Time 1439472654000 ms is invalid as zeroTime
is 1439457657000 ms and slideDuration is 45000 ms and difference is 14997000 ms
15/08/17 00:32:54 INFO dstream.FilteredDStream: Time 1439472654000 ms is invalid as zeroTime
is 1439457657000 ms and slideDuration is 60000 ms and difference is 14997000 ms
15/08/17 00:32:54 INFO dstream.FilteredDStream: Time 1439472654000 ms is invalid as zeroTime
is 1439457657000 ms and slideDuration is 120000 ms and difference is 14997000 ms
15/08/17 00:32:54 INFO scheduler.JobScheduler: Added jobs for time 1439472654000 ms
15/08/17 00:32:54 INFO scheduler.JobGenerator: Checkpointing graph for time 1439472654000
ms
15/08/17 00:32:54 INFO streaming.DStreamGraph: Updating checkpoint data for time 1439472654000
ms
15/08/17 00:32:54 WARN scheduler.TaskSetManager: Lost task 11.0 in stage 3719.0 (TID 707634,
A01-R08-2-I160-115.JD.LOCAL): ExecutorLostFailure (executor 5 lost)
15/08/17 00:32:54 INFO streaming.DStreamGraph: Updated checkpoint data for time 1439472654000
ms
15/08/17 00:32:54 WARN scheduler.TaskSetManager: Lost task 2.0 in stage 3719.0 (TID 707625,
A01-R08-2-I160-115.JD.LOCAL): ExecutorLostFailure (executor 5 lost)
15/08/17 00:32:54 WARN scheduler.TaskSetManager: Lost task 5.0 in stage 3719.0 (TID 707628,
A01-R08-2-I160-115.JD.LOCAL): ExecutorLostFailure (executor 5 lost)
15/08/17 00:32:54 WARN scheduler.TaskSetManager: Lost task 8.0 in stage 3719.0 (TID 707631,
A01-R08-2-I160-115.JD.LOCAL): ExecutorLostFailure (executor 5 lost)
15/08/17  00:32:54 INFO scheduler.DAGScheduler: Executor lost: 3 (epoch 930)
15/08/17  00:32:54 INFO storage.BlockManagerMasterActor: Trying to remove executor 3 from
BlockManagerMaster.
15/08/17  00:32:54 INFO storage.BlockManagerMaster: Removed 3 successfully in removeExecutor
15/08/17  00:32:54 INFO scheduler.Stage: Stage 3718 is now unavailable on executor 3 (111/180,
false)
15/08/17  00:32:54 INFO streaming.CheckpointWriter: Saving checkpoint for time 1439472654000
ms to file 'hdfs://A01-R08-3-I160-102.JD.LOCAL:9000/regeditCountSparkDealerUser/checkpoint/checkpoint-1439472654000'



Some one says that it is caused by OOM, but I can not find any stack of OOM.


I set the spark-defaults.con as follows:
spark.core.connection.ack.wait.timeout  3600
spark.core.connection.auth.wait.timeout 3600
spark.akka.frameSize                    1024
spark.driver.extraJavaOptions           -Dhdp.version=2.2.0.0–2041
spark.yarn.am.extraJavaOptions          -Dhdp.version=2.2.0.0–2041
spark.akka.timeout                      900
spark.storage.memoryFraction            0.4
spark.rdd.compress      



It is very appreciated that anyone can tell me how to solve this problem. It has botherd me
for a long time.
Mime
View raw message