spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From ZhuGe <t...@outlook.com>
Subject master die and worker registration failed with duplicated worker id
Date Mon, 19 Oct 2015 07:43:17 GMT
Hi all:We met a serial of weir problem in our standalone cluster with 2 masters(zk election
agent).
Q1 :Firstly, we find the active master would lose leadership at some point and shutdown itself.
[INFO 2015-10-17 13:00:15 (ClientCnxn.java:1083)] Client session timed out, have not heard
from server in 27132ms for sessionid 0x14fef0664190018, closing socket connection and attempting
reconnect[INFO 2015-10-17 13:00:15 (ConnectionStateManager.java:194)] State change: SUSPENDED[INFO
2015-10-17 13:00:15 (Logging.scala:59)] We have lost leadership[ERROR 2015-10-17 13:00:15
(Logging.scala:75)] Leadership has been revoked -- master shutting down.[INFO 2015-10-17 13:00:15
(Logging.scala:59)] Shutdown hook called
>From the log, it seems that the zk session timeout. Then zk election agent revoke the
master. We suspect a long full gc hang the process. We monitor the gc and find that after
3+ days' working, there is 200+ minor gc and no full gc.  We dump the heap and find that the
JobProgressListener consume a log of memory. Is it a bug or some configuration problem? 
123 instances of "org.apache.spark.ui.jobs.JobProgressListener", loaded by "sun.misc.Launcher$AppClassLoader
@ 0xbffa2420" occupy 338.25 MB (86.70%) bytes. 
Biggest instances:
•org.apache.spark.ui.jobs.JobProgressListener @ 0xc5e4de80 - 18.42 MB (4.72%) bytes. •org.apache.spark.ui.jobs.JobProgressListener
@ 0xc7298f70 - 18.42 MB (4.72%) bytes. •org.apache.spark.ui.jobs.JobProgressListener @ 0xc870d960
- 18.42 MB (4.72%) bytes. •org.apache.spark.ui.jobs.JobProgressListener @ 0xc498bc50 - 18.42
MB (4.72%) bytes. •org.apache.spark.ui.jobs.JobProgressListener @ 0xc09a69b0 - 18.42 MB
(4.72%) bytes. •org.apache.spark.ui.jobs.JobProgressListener @ 0xca84dbc0 - 14.57 MB (3.74%)
bytes. •org.apache.spark.ui.jobs.JobProgressListener @ 0xc2850b10 - 14.53 MB (3.73%) bytes.
•org.apache.spark.ui.jobs.JobProgressListener @ 0xd7118058 - 14.50 MB (3.72%) bytes. •org.apache.spark.ui.jobs.JobProgressListener
@ 0xd02fbba8 - 14.41 MB (3.69%) bytes. •org.apache.spark.ui.jobs.JobProgressListener @ 0xd3b81ef0
- 12.07 MB (3.09%) bytes. •org.apache.spark.ui.jobs.JobProgressListener @ 0xd90546d0 - 11.43
MB (2.93%) bytes. •org.apache.spark.ui.jobs.JobProgressListener @ 0xd1935b30 - 11.43 MB
(2.93%) bytes. •org.apache.spark.ui.jobs.JobProgressListener @ 0xd9d50e48 - 6.90 MB (1.77%)
bytes. •org.apache.spark.ui.jobs.JobProgressListener @ 0xc3971b90 - 6.90 MB (1.77%) bytes.
•org.apache.spark.ui.jobs.JobProgressListener @ 0xcc78c918 - 6.90 MB (1.77%) bytes. •org.apache.spark.ui.jobs.JobProgressListener
@ 0xd27d6ce0 - 6.90 MB (1.77%) bytes. •org.apache.spark.ui.jobs.JobProgressListener @ 0xcbdfa0a0
- 6.32 MB (1.62%) bytes. •org.apache.spark.ui.jobs.JobProgressListener @ 0xd6736f90 - 4.52
MB (1.16%) bytes. •org.apache.spark.ui.jobs.JobProgressListener @ 0xdc8be4a0 - 4.51 MB (1.16%)
bytes. •org.apache.spark.ui.jobs.JobProgressListener @ 0xdbca8e78 - 4.43 MB (1.13%) bytes.
•org.apache.spark.ui.jobs.JobProgressListener @ 0xca0c5d78 - 4.42 MB (1.13%) bytes. •org.apache.spark.ui.jobs.JobProgressListener
@ 0xc2101c40 - 4.41 MB (1.13%) bytes. •org.apache.spark.ui.jobs.JobProgressListener @ 0xcfa67e38
- 4.41 MB (1.13%) bytes. These instances are referenced from one instance of "org.spark-project.jetty.server.Handler[]",
loaded by "sun.misc.Launcher$AppClassLoader @ 0xbffa2420"
Keywordsorg.apache.spark.ui.jobs.JobProgressListenerorg.spark-project.jetty.server.Handler[]sun.misc.Launcher$AppClassLoader
@ 0xbffa2420



Q2:When the active master die and backup master get the leadership, there another problem
pop up. some woker would failed to register in the new master. As consequence, the new master
would lose some workers. Log showed below:

[INFO 2015-10-19 15:01:45 (Logging.scala:59)] Master has changed, new master is at spark://10.12.201.163:7071[INFO
2015-10-19 15:02:57 (Logging.scala:59)] Master with url spark://10.12.201.162:7071 requested
this worker to reconnect.[INFO 2015-10-19 15:02:57 (Logging.scala:59)] Connecting to master
akka.tcp://sparkMaster@10.12.201.162:7071/user/Master...[INFO 2015-10-19 15:02:57 (Logging.scala:59)]
Master with url spark://10.12.201.162:7071 requested this worker to reconnect.[INFO 2015-10-19
15:02:57 (Logging.scala:59)] Not spawning another attempt to register with the master, since
there is an attempt scheduled already.[INFO 2015-10-19 15:02:57 (Logging.scala:59)] Master
with url spark://10.12.201.162:7071 requested this worker to reconnect.[INFO 2015-10-19 15:02:57
(Logging.scala:59)] Not spawning another attempt to register with the master, since there
is an attempt scheduled already.[WARN 2015-10-19 15:02:57 (Slf4jLogger.scala:71)] Association
with remote system [akka.tcp://sparkMaster@10.12.201.162:7071] has failed, address is now
gated for [5000] ms. Reason is: [Disassociated].[INFO 2015-10-19 15:03:01 (Logging.scala:59)]
Asked to launch executor app-20151019150302-0000/8 for DirectKafkaStreaming[INFO 2015-10-19
15:03:01 (Logging.scala:59)] Launch command: "/data/dbcenter/jdk1.7.0_79/bin/java" "-cp" "/data/dbcenter/cdh5/spark-1.4.0-bin-hadoop2.4/postgresql-9.2-1004-jdbc41.jar:/data/dbcenter/cdh5/spark-1.4.0-bin-hadoop2.4/lib/hive-contrib-0.13.1-cdh5.2.0.jar:/data/dbcenter/cdh5/spark-1.4.0-bin-hadoop2.4/sbin/../conf/:/data/dbcenter/cdh5/spark-1.4.0-bin-hadoop2.4/lib/spark-assembly-1.4.0-hadoop2.4.0.jar:/data/dbcenter/cdh5/spark-1.4.0-bin-hadoop2.4/lib/datanucleus-rdbms-3.2.9.jar:/data/dbcenter/cdh5/spark-1.4.0-bin-hadoop2.4/lib/datanucleus-api-jdo-3.2.6.jar:/data/dbcenter/cdh5/spark-1.4.0-bin-hadoop2.4/lib/datanucleus-core-3.2.10.jar"
"-Xms1024M" "-Xmx1024M" "-Dspark.akka.frameSize=100" "-Dspark.driver.port=12077" "-XX:MaxPermSize=128m"
"org.apache.spark.executor.CoarseGrainedExecutorBackend" "--driver-url" "akka.tcp://sparkDriver@10.12.201.183:12077/user/CoarseGrainedScheduler"
"--executor-id" "8" "--hostname" "10.12.201.166" "--cores" "1" "--app-id" "app-20151019150302-0000"
"--worker-url" "akka.tcp://sparkWorker@10.12.201.166:30397/user/Worker"[INFO 2015-10-19 15:03:03
(Logging.scala:59)] Retrying connection to master (attempt # 1)[ERROR 2015-10-19 15:03:03
(Logging.scala:75)] Worker registration failed: Duplicate worker ID[INFO 2015-10-19 15:03:03
(Logging.scala:59)] Killing process![INFO 2015-10-19 15:03:03 (Logging.scala:59)] Killing
process!


Any help would be help!
Cheers
 		 	   		  
Mime
View raw message