Hi all:
We met a serial of weir problem in our standalone cluster with 2 masters(zk election agent).

Q1 :
Firstly, we find the active master would lose leadership at some point and shutdown itself.

[INFO 2015-10-17 13:00:15 (ClientCnxn.java:1083)] Client session timed out, have not heard from server in 27132ms for sessionid 0x14fef0664190018, closing socket connection and attempting reconnect
[INFO 2015-10-17 13:00:15 (ConnectionStateManager.java:194)] State change: SUSPENDED
[INFO 2015-10-17 13:00:15 (Logging.scala:59)] We have lost leadership
[ERROR 2015-10-17 13:00:15 (Logging.scala:75)] Leadership has been revoked -- master shutting down.
[INFO 2015-10-17 13:00:15 (Logging.scala:59)] Shutdown hook called

From the log, it seems that the zk session timeout. Then zk election agent revoke the master. 
We suspect a long full gc hang the process. We monitor the gc and find that after 3+ days' working, there is 200+ minor gc and no full gc.  We dump the heap and find that the JobProgressListener consume a log of memory. 
Is it a bug or some configuration problem? 

123 instances of "org.apache.spark.ui.jobs.JobProgressListener", loaded by "sun.misc.Launcher$AppClassLoader @ 0xbffa2420" occupy 338.25 MB (86.70%) bytes. 

Biggest instances:

Ľorg.apache.spark.ui.jobs.JobProgressListener @ 0xc5e4de80 - 18.42 MB (4.72%) bytes. 
Ľorg.apache.spark.ui.jobs.JobProgressListener @ 0xc7298f70 - 18.42 MB (4.72%) bytes. 
Ľorg.apache.spark.ui.jobs.JobProgressListener @ 0xc870d960 - 18.42 MB (4.72%) bytes. 
Ľorg.apache.spark.ui.jobs.JobProgressListener @ 0xc498bc50 - 18.42 MB (4.72%) bytes. 
Ľorg.apache.spark.ui.jobs.JobProgressListener @ 0xc09a69b0 - 18.42 MB (4.72%) bytes. 
Ľorg.apache.spark.ui.jobs.JobProgressListener @ 0xca84dbc0 - 14.57 MB (3.74%) bytes. 
Ľorg.apache.spark.ui.jobs.JobProgressListener @ 0xc2850b10 - 14.53 MB (3.73%) bytes. 
Ľorg.apache.spark.ui.jobs.JobProgressListener @ 0xd7118058 - 14.50 MB (3.72%) bytes. 
Ľorg.apache.spark.ui.jobs.JobProgressListener @ 0xd02fbba8 - 14.41 MB (3.69%) bytes. 
Ľorg.apache.spark.ui.jobs.JobProgressListener @ 0xd3b81ef0 - 12.07 MB (3.09%) bytes. 
Ľorg.apache.spark.ui.jobs.JobProgressListener @ 0xd90546d0 - 11.43 MB (2.93%) bytes. 
Ľorg.apache.spark.ui.jobs.JobProgressListener @ 0xd1935b30 - 11.43 MB (2.93%) bytes. 
Ľorg.apache.spark.ui.jobs.JobProgressListener @ 0xd9d50e48 - 6.90 MB (1.77%) bytes. 
Ľorg.apache.spark.ui.jobs.JobProgressListener @ 0xc3971b90 - 6.90 MB (1.77%) bytes. 
Ľorg.apache.spark.ui.jobs.JobProgressListener @ 0xcc78c918 - 6.90 MB (1.77%) bytes. 
Ľorg.apache.spark.ui.jobs.JobProgressListener @ 0xd27d6ce0 - 6.90 MB (1.77%) bytes. 
Ľorg.apache.spark.ui.jobs.JobProgressListener @ 0xcbdfa0a0 - 6.32 MB (1.62%) bytes. 
Ľorg.apache.spark.ui.jobs.JobProgressListener @ 0xd6736f90 - 4.52 MB (1.16%) bytes. 
Ľorg.apache.spark.ui.jobs.JobProgressListener @ 0xdc8be4a0 - 4.51 MB (1.16%) bytes. 
Ľorg.apache.spark.ui.jobs.JobProgressListener @ 0xdbca8e78 - 4.43 MB (1.13%) bytes. 
Ľorg.apache.spark.ui.jobs.JobProgressListener @ 0xca0c5d78 - 4.42 MB (1.13%) bytes. 
Ľorg.apache.spark.ui.jobs.JobProgressListener @ 0xc2101c40 - 4.41 MB (1.13%) bytes. 
Ľorg.apache.spark.ui.jobs.JobProgressListener @ 0xcfa67e38 - 4.41 MB (1.13%) bytes. 
These instances are referenced from one instance of "org.spark-project.jetty.server.Handler[]", loaded by "sun.misc.Launcher$AppClassLoader @ 0xbffa2420"

Keywords
org.apache.spark.ui.jobs.JobProgressListener
org.spark-project.jetty.server.Handler[]
sun.misc.Launcher$AppClassLoader @ 0xbffa2420




Q2:
When the active master die and backup master get the leadership, there another problem pop up. some woker would failed to register in the new master. As consequence, the new master would lose some workers. 
Log showed below:


[INFO 2015-10-19 15:01:45 (Logging.scala:59)] Master has changed, new master is at spark://10.12.201.163:7071
[INFO 2015-10-19 15:02:57 (Logging.scala:59)] Master with url spark://10.12.201.162:7071 requested this worker to reconnect.
[INFO 2015-10-19 15:02:57 (Logging.scala:59)] Connecting to master akka.tcp://sparkMaster@10.12.201.162:7071/user/Master...
[INFO 2015-10-19 15:02:57 (Logging.scala:59)] Master with url spark://10.12.201.162:7071 requested this worker to reconnect.
[INFO 2015-10-19 15:02:57 (Logging.scala:59)] Not spawning another attempt to register with the master, since there is an attempt scheduled already.
[INFO 2015-10-19 15:02:57 (Logging.scala:59)] Master with url spark://10.12.201.162:7071 requested this worker to reconnect.
[INFO 2015-10-19 15:02:57 (Logging.scala:59)] Not spawning another attempt to register with the master, since there is an attempt scheduled already.
[WARN 2015-10-19 15:02:57 (Slf4jLogger.scala:71)] Association with remote system [akka.tcp://sparkMaster@10.12.201.162:7071] has failed, address is now gated for [5000] ms. Reason is: [Disassociated].
[INFO 2015-10-19 15:03:01 (Logging.scala:59)] Asked to launch executor app-20151019150302-0000/8 for DirectKafkaStreaming
[INFO 2015-10-19 15:03:01 (Logging.scala:59)] Launch command: "/data/dbcenter/jdk1.7.0_79/bin/java" "-cp" "/data/dbcenter/cdh5/spark-1.4.0-bin-hadoop2.4/postgresql-9.2-1004-jdbc41.jar:/data/dbcenter/cdh5/spark-1.4.0-bin-hadoop2.4/lib/hive-contrib-0.13.1-cdh5.2.0.jar:/data/dbcenter/cdh5/spark-1.4.0-bin-hadoop2.4/sbin/../conf/:/data/dbcenter/cdh5/spark-1.4.0-bin-hadoop2.4/lib/spark-assembly-1.4.0-hadoop2.4.0.jar:/data/dbcenter/cdh5/spark-1.4.0-bin-hadoop2.4/lib/datanucleus-rdbms-3.2.9.jar:/data/dbcenter/cdh5/spark-1.4.0-bin-hadoop2.4/lib/datanucleus-api-jdo-3.2.6.jar:/data/dbcenter/cdh5/spark-1.4.0-bin-hadoop2.4/lib/datanucleus-core-3.2.10.jar" "-Xms1024M" "-Xmx1024M" "-Dspark.akka.frameSize=100" "-Dspark.driver.port=12077" "-XX:MaxPermSize=128m" "org.apache.spark.executor.CoarseGrainedExecutorBackend" "--driver-url" "akka.tcp://sparkDriver@10.12.201.183:12077/user/CoarseGrainedScheduler" "--executor-id" "8" "--hostname" "10.12.201.166" "--cores" "1" "--app-id" "app-20151019150302-0000" "--worker-url" "akka.tcp://sparkWorker@10.12.201.166:30397/user/Worker"
[INFO 2015-10-19 15:03:03 (Logging.scala:59)] Retrying connection to master (attempt # 1)
[ERROR 2015-10-19 15:03:03 (Logging.scala:75)] Worker registration failed: Duplicate worker ID
[INFO 2015-10-19 15:03:03 (Logging.scala:59)] Killing process!
[INFO 2015-10-19 15:03:03 (Logging.scala:59)] Killing process!



Any help would be help!

Cheers