zookeeper-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Yuval Dori <yuval.d...@gigaspaces.com>
Subject Re: IllegalStateException: instance must be started before calling this method
Date Mon, 26 Mar 2018 11:13:35 GMT
Hi Martin,

Not in a Docker. We are using the word container for one of our services.

Thanks,

Yuval

On Sun, Mar 25, 2018 at 11:57 PM, Martin Gainty <mgainty@hotmail.com> wrote:

>
>
> in docker  container?
>
> if so how do you resolve hosts (.hosts file, kubernetes, swarm)?
>
> M
> ________________________________
> From: Yuval Dori <yuval.dori@gigaspaces.com>
> Sent: Sunday, March 25, 2018 5:34 AM
> To: user@zookeeper.apache.org
> Cc: Yuval Dori
> Subject: IllegalStateException: instance must be started before calling
> this method
>
> Hi,
>
> Our customer is using ZK 3.4.3 and got IllegalStateException: instance
> must be started before calling this method
>
>
> with the following stack trace:
> 2018-02-15T09:58:12.094+0100; ERROR; WSOSTSLXWIT01/MANAGER; P3424/T194;
> [SPACE/LearnerHandler-/10.17.46.142:49336/LearnerHandler<ht
> tp://10.17.46.142:49336/LearnerHandler>]; Unexpected exception causing
> shutdown while sock still open
> java.net<http://java.net>.SocketTimeoutException: Read timed out
> at java.net.SocketInputStream.socketRead0(Native Method)
> at java.net.SocketInputStream.socketRead(SocketInputStream.java:116)
> at java.net.SocketInputStream.read(SocketInputStream.java:171)
> at java.net.SocketInputStream.read(SocketInputStream.java:141)
> at java.io.BufferedInputStream.fill(BufferedInputStream.java:246)
> at java.io.BufferedInputStream.read(BufferedInputStream.java:265)
> at java.io.DataInputStream.readInt(DataInputStream.java:387)
> at org.apache.jute.BinaryInputArchive.readInt(BinaryInputArchive.java:63)
> at org.apache.zookeeper.server.quorum.QuorumPacket.
> deserialize(QuorumPacket.java:83)
> at org.apache.jute.BinaryInputArchive.readRecord(
> BinaryInputArchive.java:99)
> at org.apache.zookeeper.server.quorum.LearnerHandler.run(
> LearnerHandler.java:542)
>
>
> This is the description:
>
> At first ps-main.1 is started as primary on WSOSTSLXBMS01 GSC 3.
>
> 24.01.2018 11:31:04,947 +01:00 INFO WSOSTSLXBMS01/GSC/3 P1280/T120
> SPACE/Curator-LeaderSelector-0/1 Space instance is Primary
>
> After some relocation of ps-main instances a new backup instance is
> started on the same GSC.
>
> 24.01.2018 11:32:59,886 +01:00 INFO WSOSTSLXBMS01/GSC/3 P1280/T146
> SPACE/GS-LRMI Connection-pool-1-thread-6/1 Space instance [
> frqMain_container1:frqMain ] has been selected as Backup
>
> Later a failover test is performed by shutting down WSOSTSLXBMS02 and
> WSOSTSLXMSS02.
> We would expect that ps-main.1 backup will be elected as primary because
> the former primary was located at WSOSTSLXBMS02 but no election happend. In
> the end in the whole cluster there was no primary of ps-main.1 running.
>
> We found suspicious exception in the logs:
> 24.01.2018 11:32:44,987 +01:00 INFO WSOSTSLXBMS01/GSC/3 P1280/T120
> SPACE/Curator-LeaderSelector-0/frqMain_container1:frqMain Lost leadership
> [sessionId=0x36128253b820001]
> 24.01.2018 11:32:44,987 +01:00 INFO WSOSTSLXBMS01/GSC/3 P1280/T98
> SPACE/Curator-Framework-0/CuratorFrameworkImpl backgroundOperationsLoop
> exiting
> 24.01.2018 11:32:44,990 +01:00 INFO WSOSTSLXBMS01/GSC/3 P1280/T120
> SPACE/Curator-LeaderSelector-0/1 Quiesce state set to SUSPENDED
> 24.01.2018 11:32:44,991 +01:00 INFO WSOSTSLXBMS01/GSC/3 P1280/T24
> SPACE/GS-LRMI Connection-pool-1-thread-1/ZooKeeper Session:
> 0x36128253b820001 closed
> 24.01.2018 11:32:44,991 +01:00 ERROR WSOSTSLXBMS01/GSC/3 P1280/T120
> SPACE/Curator-LeaderSelector-0/LeaderSelector The leader threw an
> exception
> 24.01.2018 11:32:44,991 +01:00 ERROR WSOSTSLXBMS01/GSC/3 P1280/T120
> SPACE/Curator-LeaderSelector-0/LeaderSelector java.lang.IllegalStateException:
> instance must be started before calling this method
> 24.01.2018 11:32:44,991 +01:00 ERROR WSOSTSLXBMS01/GSC/3 P1280/T120
> SPACE/Curator-LeaderSelector-0/LeaderSelector at
> org.apache.curator.shaded.com<http://org.apache.curator.shaded.com
> >.google.common.base.Preconditions.checkState(Preconditions.java:176)
> 24.01.2018 11:32:44,991 +01:00 ERROR WSOSTSLXBMS01/GSC/3 P1280/T120
> SPACE/Curator-LeaderSelector-0/LeaderSelector at
> org.apache.curator.framework.imps.CuratorFrameworkImpl.
> delete(CuratorFrameworkImpl.java:359)
> 24.01.2018 11:32:44,991 +01:00 ERROR WSOSTSLXBMS01/GSC/3 P1280/T120
> SPACE/Curator-LeaderSelector-0/LeaderSelector at
> org.apache.curator.framework.recipes.locks.LockInternals.
> deleteOurPath(LockInternals.java:339)
> 24.01.2018 11:32:44,991 +01:00 ERROR WSOSTSLXBMS01/GSC/3 P1280/T120
> SPACE/Curator-LeaderSelector-0/LeaderSelector at
> org.apache.curator.framework.recipes.locks.LockInternals.
> releaseLock(LockInternals.java:123)
> 24.01.2018 11:32:44,991 +01:00 ERROR WSOSTSLXBMS01/GSC/3 P1280/T120
> SPACE/Curator-LeaderSelector-0/LeaderSelector at
> org.apache.curator.framework.recipes.locks.InterProcessMutex.release(
> InterProcessMutex.java:154)
> 24.01.2018 11:32:44,991 +01:00 ERROR WSOSTSLXBMS01/GSC/3 P1280/T120
> SPACE/Curator-LeaderSelector-0/LeaderSelector at
> org.apache.curator.framework.recipes.leader.LeaderSelector.
> doWork(LeaderSelector.java:449)
> 24.01.2018 11:32:44,991 +01:00 ERROR WSOSTSLXBMS01/GSC/3 P1280/T120
> SPACE/Curator-LeaderSelector-0/LeaderSelector at
> org.apache.curator.framework.recipes.leader.LeaderSelector.
> doWorkLoop(LeaderSelector.java:466)
> 24.01.2018 11:32:44,991 +01:00 ERROR WSOSTSLXBMS01/GSC/3 P1280/T120
> SPACE/Curator-LeaderSelector-0/LeaderSelector at
> org.apache.curator.framework.recipes.leader.LeaderSelector.
> access$100(LeaderSelector.java:65)
> 24.01.2018 11:32:44,991 +01:00 ERROR WSOSTSLXBMS01/GSC/3 P1280/T120
> SPACE/Curator-LeaderSelector-0/LeaderSelector at
> org.apache.curator.framework.recipes.leader.LeaderSelector$
> 2.call(LeaderSelector.java:246)
> 24.01.2018 11:32:44,991 +01:00 ERROR WSOSTSLXBMS01/GSC/3 P1280/T120
> SPACE/Curator-LeaderSelector-0/LeaderSelector at
> org.apache.curator.framework.recipes.leader.LeaderSelector$
> 2.call(LeaderSelector.java:240)
> 24.01.2018 11:32:44,991 +01:00 ERROR WSOSTSLXBMS01/GSC/3 P1280/T120
> SPACE/Curator-LeaderSelector-0/LeaderSelector at java.util.concurrent.
> FutureTask.run(FutureTask.java:266)
> 24.01.2018 11:32:44,991 +01:00 ERROR WSOSTSLXBMS01/GSC/3 P1280/T120
> SPACE/Curator-LeaderSelector-0/LeaderSelector at java.util.concurrent.
> Executors$RunnableAdapter.call(Executors.java:511)
> 24.01.2018 11:32:44,991 +01:00 ERROR WSOSTSLXBMS01/GSC/3 P1280/T120
> SPACE/Curator-LeaderSelector-0/LeaderSelector at java.util.concurrent.
> FutureTask.run(FutureTask.java:266)
> 24.01.2018 11:32:44,991 +01:00 ERROR WSOSTSLXBMS01/GSC/3 P1280/T120
> SPACE/Curator-LeaderSelector-0/LeaderSelector at java.util.concurrent.
> ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
> 24.01.2018 11:32:44,991 +01:00 ERROR WSOSTSLXBMS01/GSC/3 P1280/T120
> SPACE/Curator-LeaderSelector-0/LeaderSelector at java.util.concurrent.
> ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
> 24.01.2018 11:32:44,991 +01:00 ERROR WSOSTSLXBMS01/GSC/3 P1280/T120
> SPACE/Curator-LeaderSelector-0/LeaderSelector at
> java.lang.Thread.run(Thread.java:748)
> 24.01.2018 11:32:44,991 +01:00 ERROR WSOSTSLXBMS01/GSC/3 P1280/T120
> SPACE/Curator-LeaderSelector-0/LeaderSelector
> 24.01.2018 11:32:44,991 +01:00 INFO WSOSTSLXBMS01/GSC/3 P1280/T97
> SPACE/GS-LRMI Connection-pool-1-thread-1-EventThread/ClientCnxn
> EventThread shut down for session: 0x36128253b820001
> 24.01.2018 11:32:44,997 +01:00 INFO WSOSTSLXBMS01/GSC/3 P1280/T24
> SPACE/GS-LRMI Connection-pool-1-thread-1/frqMain1_1 Channel is closed
> [target=frqMain1_1, target url=jini://*/frqMain_
> container1_1/frqMain?groups=POLBAY&ignoreValidation=true&
> schema=persistent&id=1&total_members=4,1&versioned=true&
> mirror=true&cluster_schema=partitioned&locators=
> WSOSTSLXMSS01,WSOSTSLXMSS02&state=started&timeout=5000, target machine
> connection url=NIO://WSOSTSLXBMS02:11002/pid[3756]/24795468496_3_
> 2538032931939568226_details[class com.gigaspaces.internal.
> cluster.node.impl.router.AbstractConnectionProxyBasedReplicationRouter$
> ConnectionEndpoint(frqMain_container1_1:frqMain)]]
> 24.01.2018 11:32:44,998 +01:00 INFO WSOSTSLXBMS01/GSC/3 P1280/T24
> SPACE/GS-LRMI Connection-pool-1-thread-1/ps-main-mirror-service Channel
> is closed [target=ps-main-mirror-service, target
> url=jini://*/ps-main-mirror-service_container/ps-main-
> mirror-service?schema=persistent&id=1&total_members=
> 4,1&versioned=true&mirror=true&groups=POLBAY&cluster_
> schema=partitioned&locators=WSOSTSLXMSS01,WSOSTSLXMSS02&state=started&timeout=5000,
> target machine connection url=NIO://WSOSTSLXBMS01:11002/
> pid[3220]/25906827215_3_7196393537085178059_details[class
> com.gigaspaces.internal.cluster.node.impl.router.
> AbstractConnectionProxyBasedReplicationRouter$ConnectionEndpoint(ps-main-
> mirror-service_container:ps-main-mirror-service)]]
> 24.01.2018 11:32:45,767 +01:00 INFO WSOSTSLXBMS01/GSC/3 P1280/T24
> SPACE/GS-LRMI Connection-pool-1-thread-1/1 Shutdown complete
> 24.01.2018 11:32:45,768 +01:00 INFO WSOSTSLXBMS01/GSC/3 P1280/T24
> SPACE/GS-LRMI Connection-pool-1-thread-1/container Container
> <frqMain_container1> shutdown completed
>
> 24.01.2018 11:32:46,037 +01:00 WARN WSOSTSLXBMS01/GSC/3 P1280/T24
> SPACE/GS-LRMI Connection-pool-1-thread-1/PUServiceBeanImpl Failed to
> delete deployed processing unit from [C:\Program
> Files\Frequentis\Frequentis LifeX Platform\gigaspaces\work\
> processing-units\ps-main_1_2008048111]
> 24.01.2018 11:32:46,037 +01:00 INFO WSOSTSLXBMS01/GSC/3 P1280/T24
> SPACE/GS-LRMI Connection-pool-1-thread-1/PUServiceBeanImpl Stopped
>
> Attached thread dump. Failover happend around 24.01.2018 11:40:20,893
> +01:00
> WARN WSOSTSLXBMS01/GSC/3 P1280/T59 SPACE/GS-Watchdog/watchdog The
> ServerEndPoint [WSOSTSLXMSS02/10.17.46.142:11001<http://10.17.46.142:
> 11001/>] is not reachable or is reachable but with no matching invocation
> in progress at the server peer, closing invalid connection with local
> address [/10.17.46.136:49864<http://10.17.46.136:49864/>] [ remote
> invocation of: net.jini.lookup.DiscoveryAdmin.getMemberGroups] [java.net<
> http://java.net>.SocketTimeoutException]
>
>
> Could it be ZK bug?
>
> Thanks,
>
> Yuval
>
>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message