zookeeper-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Yuval Dori <yuval.d...@gigaspaces.com>
Subject IllegalStateException: instance must be started before calling this method
Date Sun, 25 Mar 2018 09:34:13 GMT
Hi,

Our customer is using ZK 3.4.3 and got IllegalStateException: instance must be started before
calling this method


with the following stack trace:
2018-02-15T09:58:12.094+0100; ERROR; WSOSTSLXWIT01/MANAGER; P3424/T194; [SPACE/LearnerHandler-/10.17.46.142:49336/LearnerHandler
<http://10.17.46.142:49336/LearnerHandler>]; Unexpected exception causing shutdown while
sock still open 
java.net.SocketTimeoutException: Read timed out 
at java.net.SocketInputStream.socketRead0(Native Method) 
at java.net.SocketInputStream.socketRead(SocketInputStream.java:116) 
at java.net.SocketInputStream.read(SocketInputStream.java:171) 
at java.net.SocketInputStream.read(SocketInputStream.java:141) 
at java.io.BufferedInputStream.fill(BufferedInputStream.java:246) 
at java.io.BufferedInputStream.read(BufferedInputStream.java:265) 
at java.io.DataInputStream.readInt(DataInputStream.java:387) 
at org.apache.jute.BinaryInputArchive.readInt(BinaryInputArchive.java:63) 
at org.apache.zookeeper.server.quorum.QuorumPacket.deserialize(QuorumPacket.java:83) 
at org.apache.jute.BinaryInputArchive.readRecord(BinaryInputArchive.java:99) 
at org.apache.zookeeper.server.quorum.LearnerHandler.run(LearnerHandler.java:542) 


This is the description:
 
At first ps-main.1 is started as primary on WSOSTSLXBMS01 GSC 3. 

24.01.2018 11:31:04,947 +01:00	INFO	WSOSTSLXBMS01/GSC/3	P1280/T120	SPACE/Curator-LeaderSelector-0/1
Space instance is Primary 

After some relocation of ps-main instances a new backup instance is started on the same GSC.


24.01.2018 11:32:59,886 +01:00	INFO	WSOSTSLXBMS01/GSC/3	P1280/T146	SPACE/GS-LRMI Connection-pool-1-thread-6/1
Space instance [ frqMain_container1:frqMain ] has been selected as Backup 

Later a failover test is performed by shutting down WSOSTSLXBMS02 and WSOSTSLXMSS02. 
We would expect that ps-main.1 backup will be elected as primary because the former primary
was located at WSOSTSLXBMS02 but no election happend. In the end in the whole cluster there
was no primary of ps-main.1 running. 

We found suspicious exception in the logs: 
24.01.2018 11:32:44,987 +01:00	INFO	WSOSTSLXBMS01/GSC/3	P1280/T120	SPACE/Curator-LeaderSelector-0/frqMain_container1:frqMain
Lost leadership [sessionId=0x36128253b820001] 
24.01.2018 11:32:44,987 +01:00	INFO	WSOSTSLXBMS01/GSC/3	P1280/T98	SPACE/Curator-Framework-0/CuratorFrameworkImpl
backgroundOperationsLoop exiting 
24.01.2018 11:32:44,990 +01:00	INFO	WSOSTSLXBMS01/GSC/3	P1280/T120	SPACE/Curator-LeaderSelector-0/1
Quiesce state set to SUSPENDED 
24.01.2018 11:32:44,991 +01:00	INFO	WSOSTSLXBMS01/GSC/3	P1280/T24	SPACE/GS-LRMI Connection-pool-1-thread-1/ZooKeeper
Session: 0x36128253b820001 closed 
24.01.2018 11:32:44,991 +01:00	ERROR	WSOSTSLXBMS01/GSC/3	P1280/T120	SPACE/Curator-LeaderSelector-0/LeaderSelector
The leader threw an exception 
24.01.2018 11:32:44,991 +01:00	ERROR	WSOSTSLXBMS01/GSC/3	P1280/T120	SPACE/Curator-LeaderSelector-0/LeaderSelector
java.lang.IllegalStateException: instance must be started before calling this method 
24.01.2018 11:32:44,991 +01:00	ERROR	WSOSTSLXBMS01/GSC/3	P1280/T120	SPACE/Curator-LeaderSelector-0/LeaderSelector
at org.apache.curator.shaded.com.google.common.base.Preconditions.checkState(Preconditions.java:176)

24.01.2018 11:32:44,991 +01:00	ERROR	WSOSTSLXBMS01/GSC/3	P1280/T120	SPACE/Curator-LeaderSelector-0/LeaderSelector
at org.apache.curator.framework.imps.CuratorFrameworkImpl.delete(CuratorFrameworkImpl.java:359)

24.01.2018 11:32:44,991 +01:00	ERROR	WSOSTSLXBMS01/GSC/3	P1280/T120	SPACE/Curator-LeaderSelector-0/LeaderSelector
at org.apache.curator.framework.recipes.locks.LockInternals.deleteOurPath(LockInternals.java:339)

24.01.2018 11:32:44,991 +01:00	ERROR	WSOSTSLXBMS01/GSC/3	P1280/T120	SPACE/Curator-LeaderSelector-0/LeaderSelector
at org.apache.curator.framework.recipes.locks.LockInternals.releaseLock(LockInternals.java:123)

24.01.2018 11:32:44,991 +01:00	ERROR	WSOSTSLXBMS01/GSC/3	P1280/T120	SPACE/Curator-LeaderSelector-0/LeaderSelector
at org.apache.curator.framework.recipes.locks.InterProcessMutex.release(InterProcessMutex.java:154)

24.01.2018 11:32:44,991 +01:00	ERROR	WSOSTSLXBMS01/GSC/3	P1280/T120	SPACE/Curator-LeaderSelector-0/LeaderSelector
at org.apache.curator.framework.recipes.leader.LeaderSelector.doWork(LeaderSelector.java:449)

24.01.2018 11:32:44,991 +01:00	ERROR	WSOSTSLXBMS01/GSC/3	P1280/T120	SPACE/Curator-LeaderSelector-0/LeaderSelector
at org.apache.curator.framework.recipes.leader.LeaderSelector.doWorkLoop(LeaderSelector.java:466)

24.01.2018 11:32:44,991 +01:00	ERROR	WSOSTSLXBMS01/GSC/3	P1280/T120	SPACE/Curator-LeaderSelector-0/LeaderSelector
at org.apache.curator.framework.recipes.leader.LeaderSelector.access$100(LeaderSelector.java:65)

24.01.2018 11:32:44,991 +01:00	ERROR	WSOSTSLXBMS01/GSC/3	P1280/T120	SPACE/Curator-LeaderSelector-0/LeaderSelector
at org.apache.curator.framework.recipes.leader.LeaderSelector$2.call(LeaderSelector.java:246)

24.01.2018 11:32:44,991 +01:00	ERROR	WSOSTSLXBMS01/GSC/3	P1280/T120	SPACE/Curator-LeaderSelector-0/LeaderSelector
at org.apache.curator.framework.recipes.leader.LeaderSelector$2.call(LeaderSelector.java:240)

24.01.2018 11:32:44,991 +01:00	ERROR	WSOSTSLXBMS01/GSC/3	P1280/T120	SPACE/Curator-LeaderSelector-0/LeaderSelector
at java.util.concurrent.FutureTask.run(FutureTask.java:266) 
24.01.2018 11:32:44,991 +01:00	ERROR	WSOSTSLXBMS01/GSC/3	P1280/T120	SPACE/Curator-LeaderSelector-0/LeaderSelector
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) 
24.01.2018 11:32:44,991 +01:00	ERROR	WSOSTSLXBMS01/GSC/3	P1280/T120	SPACE/Curator-LeaderSelector-0/LeaderSelector
at java.util.concurrent.FutureTask.run(FutureTask.java:266) 
24.01.2018 11:32:44,991 +01:00	ERROR	WSOSTSLXBMS01/GSC/3	P1280/T120	SPACE/Curator-LeaderSelector-0/LeaderSelector
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) 
24.01.2018 11:32:44,991 +01:00	ERROR	WSOSTSLXBMS01/GSC/3	P1280/T120	SPACE/Curator-LeaderSelector-0/LeaderSelector
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) 
24.01.2018 11:32:44,991 +01:00	ERROR	WSOSTSLXBMS01/GSC/3	P1280/T120	SPACE/Curator-LeaderSelector-0/LeaderSelector
at java.lang.Thread.run(Thread.java:748) 
24.01.2018 11:32:44,991 +01:00	ERROR	WSOSTSLXBMS01/GSC/3	P1280/T120	SPACE/Curator-LeaderSelector-0/LeaderSelector

24.01.2018 11:32:44,991 +01:00	INFO	WSOSTSLXBMS01/GSC/3	P1280/T97	SPACE/GS-LRMI Connection-pool-1-thread-1-EventThread/ClientCnxn
EventThread shut down for session: 0x36128253b820001 
24.01.2018 11:32:44,997 +01:00	INFO	WSOSTSLXBMS01/GSC/3	P1280/T24	SPACE/GS-LRMI Connection-pool-1-thread-1/frqMain1_1
Channel is closed [target=frqMain1_1, target url=jini://*/frqMain_container1_1/frqMain?groups=POLBAY&ignoreValidation=true&schema=persistent&id=1&total_members=4,1&versioned=true&mirror=true&cluster_schema=partitioned&locators=WSOSTSLXMSS01,WSOSTSLXMSS02&state=started&timeout=5000
<>, target machine connection url=NIO://WSOSTSLXBMS02:11002/pid[3756]/24795468496_3_2538032931939568226_details[class
<> com.gigaspaces.internal.cluster.node.impl.router.AbstractConnectionProxyBasedReplicationRouter$ConnectionEndpoint(frqMain_container1_1:frqMain)]]

24.01.2018 11:32:44,998 +01:00	INFO	WSOSTSLXBMS01/GSC/3	P1280/T24	SPACE/GS-LRMI Connection-pool-1-thread-1/ps-main-mirror-service
Channel is closed [target=ps-main-mirror-service, target url=jini://*/ps-main-mirror-service_container/ps-main-mirror-service?schema=persistent&id=1&total_members=4,1&versioned=true&mirror=true&groups=POLBAY&cluster_schema=partitioned&locators=WSOSTSLXMSS01,WSOSTSLXMSS02&state=started&timeout=5000
<>, target machine connection url=NIO://WSOSTSLXBMS01:11002/pid[3220]/25906827215_3_7196393537085178059_details[class
<> com.gigaspaces.internal.cluster.node.impl.router.AbstractConnectionProxyBasedReplicationRouter$ConnectionEndpoint(ps-main-mirror-service_container:ps-main-mirror-service)]]

24.01.2018 11:32:45,767 +01:00	INFO	WSOSTSLXBMS01/GSC/3	P1280/T24	SPACE/GS-LRMI Connection-pool-1-thread-1/1
Shutdown complete 
24.01.2018 11:32:45,768 +01:00	INFO	WSOSTSLXBMS01/GSC/3	P1280/T24	SPACE/GS-LRMI Connection-pool-1-thread-1/container
Container <frqMain_container1> shutdown completed 

24.01.2018 11:32:46,037 +01:00	WARN	WSOSTSLXBMS01/GSC/3	P1280/T24	SPACE/GS-LRMI Connection-pool-1-thread-1/PUServiceBeanImpl
Failed to delete deployed processing unit from [C:\Program Files\Frequentis\Frequentis LifeX
Platform\gigaspaces\work\processing-units\ps-main_1_2008048111] 
24.01.2018 11:32:46,037 +01:00	INFO	WSOSTSLXBMS01/GSC/3	P1280/T24	SPACE/GS-LRMI Connection-pool-1-thread-1/PUServiceBeanImpl
Stopped 

Attached thread dump. Failover happend around 24.01.2018 11:40:20,893 +01:00	
WARN	WSOSTSLXBMS01/GSC/3	P1280/T59	SPACE/GS-Watchdog/watchdog	The ServerEndPoint [WSOSTSLXMSS02/10.17.46.142:11001
<http://10.17.46.142:11001/>] is not reachable or is reachable but with no matching
invocation in progress at the server peer, closing invalid connection with local address [/10.17.46.136:49864
<http://10.17.46.136:49864/>] [ remote invocation of: net.jini.lookup.DiscoveryAdmin.getMemberGroups]
[java.net.SocketTimeoutException]


Could it be ZK bug?

Thanks,

Yuval




Mime
View raw message