zookeeper-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From R Krishna <krishna...@gmail.com>
Subject Re: regarding zookeper cluster setup replication, config issues and inconsistent state
Date Fri, 13 May 2016 19:59:36 GMT
As I said before, I cannot even restart one server, it automatically brings
up another process.

I tried specifically setting the PID.
ps -aef | grep -i zoo
vim /var/lib/zookeeper/zookeeper_server.pid
sudo /usr/share/zookeeper/bin/zkServer.sh restart

or stop, neither works. Is there a setting to shutdown zookeper and bring
up one by one in 3 node cluster?


On Fri, May 13, 2016 at 12:57 PM, R Krishna <krishna81m@gmail.com> wrote:

> I have a fairly simple config file (below), I tried to reboot the machine
> but server 75 never restarts properly by exposing LISTEN port on 3888 and
> obviously get 2016-05-13 12:54:58,555 - WARN
> [WorkerSender[myid=3]:QuorumCnxManager@368] - Cannot open channel to 1 at
> election address /172.28.84.75:3888. Whereas 75 is unable to expose 3888
> and unable to connect to other servers with those exceptions shown before.
>
> Yes, I chose a distinct id=1 to 3 for each server. How do you do a rolling
> restart? and where do you specify to take it easy if it cannot find all
> servers?
>
> # The number of milliseconds of each tick
> tickTime=2000
> # The number of ticks that the initial
> # synchronization phase can take
> initLimit=10
> # The number of ticks that can pass between
> # sending a request and getting an acknowledgement
> syncLimit=5
> # the directory where the snapshot is stored.
> dataDir=/var/lib/zookeeper
> # Place the dataLogDir to a separate physical disc for better performance
> # dataLogDir=/disk2/zookeeper
>
> # the port at which the clients will connect
> clientPort=2181
>
> # specify all zookeeper servers
> # The fist port is used by followers to connect to the leader
> # The second one is used for leader election
> server.1=X.Y.Z.75:2888:3888
> server.2=X.Y.Z.76:2888:3888
> server.3=X.Y.Z.98:2888:3888
>
>
> On Fri, May 13, 2016 at 3:51 AM, Flavio Junqueira <fpj@apache.org> wrote:
>
>> Hi there,
>>
>> The myid needs to contain the id for each server in the ensemble, so each
>> server will have a distinct value in its myid file.
>>
>> The problem might be with you configuration file. I think you say that
>> you have specified the servers in the config file of each server, but
>> perhaps you want to have a look at the documentation to see if there is
>> anything you're missing. If you're not sure, please post it here.
>>
>> In the 3.4 branch of ZK, you have to do a rolling upgrade of the servers.
>>
>> -Flavio
>>
>> > On 13 May 2016, at 11:15, R Krishna <krishna81m@gmail.com> wrote:
>> >
>> > Just tried to setup a 2 zookeeper cluster for the first time one each
>> for
>> > my 2 Kafka broker cluster and came across following issues:
>> > 1. Do we have to specify a separate value in vim
>> ./var/lib/zookeeper/myid
>> > although they are separate machine instances?
>> > 2. I kept seeing Mode:standalone between the two servers although I saw
>> > connectivity between these two. After restarts, I saw them go to
>> > Follower/Leader.
>> > /usr/share/zookeeper/bin/zkServer.sh status
>> >    JMX enabled by default
>> >    Using config: /etc/zookeeper/conf/zoo.cfg
>> >    Mode: standalone
>> > 3. The data was completely inconsistent, I was able to connect to each
>> one
>> > run the all netcat status commands from the other server without any
>> issue.
>> > However, Kafka broker data was inconsistent and kept failing, is there a
>> > way to confirm if both nodes are in sync and part of same cluster?
>> > org.I0Itec.zkclient.exception.ZkNoNodeException:
>> > org.apache.zookeeper.KeeperException$NoNodeException: KeeperErrorCode =
>> > NoNode for /config/changes
>> >
>> > 4. Whenever I updated the .cfg file, I cannot do a sudo
>> > /usr/share/zookeeper/bin/zkServer.sh restart, I have to force kill the
>> pid,
>> > in which case in brings up another process reading the latest .cfg, why
>> is
>> > this so?
>> >
>> > 5. I realized we need at least 3 to make an ensemble, so I created and
>> > added another ZK host updated the .cfg and force killed the process so
>> it
>> > reads the latest config and started getting these exceptions. Yes, this
>> > probably means I have run out of connections.
>> >
>> > *And finally, how do I safely restart such a cluster when adding new
>> nodes
>> > and then force them to sync data?*
>> >
>> > MASTER: 75: ::::::::::::::::::::::::::::::::::::
>> > 3 09:56:03,823 - INFO  [main:FileSnap@83] - Reading snapshot
>> > /var/lib/zookeeper/version-2/snapshot.30
>> > 2016-05-13 09:56:03,860 - ERROR [main:FileTxnSnapLog@210] - Parent
>> > /brokers/ids missing for /brokers/ids/2
>> > 2016-05-13 09:56:03,862 - ERROR [main:QuorumPeer@453] - Unable to load
>> > database on disk
>> > java.io.IOException: Failed to process transaction type: 1 error:
>> > KeeperErrorCode = NoNode for /brokers/ids
>> >        at
>> >
>> org.apache.zookeeper.server.persistence.FileTxnSnapLog.restore(FileTxnSnapLog.java:153)
>> >        at
>> > org.apache.zookeeper.server.ZKDatabase.loadDataBase(ZKDatabase.java:223)
>> >        at
>> >
>> org.apache.zookeeper.server.quorum.QuorumPeer.loadDataBase(QuorumPeer.java:417)
>> >        at
>> > org.apache.zookeeper.server.quorum.QuorumPeer.start(QuorumPeer.java:409)
>> >        at
>> >
>> org.apache.zookeeper.server.quorum.QuorumPeerMain.runFromConfig(QuorumPeerMain.java:151)
>> >        at
>> >
>> org.apache.zookeeper.server.quorum.QuorumPeerMain.initializeAndRun(QuorumPeerMain.java:111)
>> >        at
>> >
>> org.apache.zookeeper.server.quorum.QuorumPeerMain.main(QuorumPeerMain.java:78)
>> > Caused by: org.apache.zookeeper.KeeperException$NoNodeException:
>> > KeeperErrorCode = NoNode for /brokers/ids
>> >        at
>> >
>> org.apache.zookeeper.server.persistence.FileTxnSnapLog.processTransaction(FileTxnSnapLog.java:211)
>> >        at
>> >
>> org.apache.zookeeper.server.persistence.FileTxnSnapLog.restore(FileTxnSnapLog.java:151)
>> >        ... 6 more
>> > 2016-05-13 09:56:03,865 - ERROR [main:QuorumPeerMain@89] - Unexpected
>> > exception, exiting abnormally
>> > java.lang.RuntimeException: Unable to run quorum server
>> >        at
>> >
>> org.apache.zookeeper.server.quorum.QuorumPeer.loadDataBase(QuorumPeer.java:454)
>> >        at
>> > org.apache.zookeeper.server.quorum.QuorumPeer.start(QuorumPeer.java:409)
>> >        at
>> >
>> org.apache.zookeeper.server.quorum.QuorumPeerMain.runFromConfig(QuorumPeerMain.java:151)
>> >        at
>> >
>> org.apache.zookeeper.server.quorum.QuorumPeerMain.initializeAndRun(QuorumPeerMain.java:111)
>> >        at
>> >
>> org.apache.zookeeper.server.quorum.QuorumPeerMain.main(QuorumPeerMain.java:78)
>> > Caused by: java.io.IOException: Failed to process transaction type: 1
>> > error: KeeperErrorCode = NoNode for /brokers/ids
>> >        at
>> >
>> org.apache.zookeeper.server.persistence.FileTxnSnapLog.restore(FileTxnSnapLog.java:153)
>> >        at
>> > org.apache.zookeeper.server.ZKDatabase.loadDataBase(ZKDatabase.java:223)
>> >        at
>> >
>> org.apache.zookeeper.server.quorum.QuorumPeer.loadDataBase(QuorumPeer.java:417)
>> >        ... 4 more
>> > Caused by: org.apache.zookeeper.KeeperException$NoNodeException:
>> > KeeperErrorCode = NoNode for /brokers/ids
>> >        at
>> >
>> org.apache.zookeeper.server.persistence.FileTxnSnapLog.processTransaction(FileTxnSnapLog.java:211)
>> >        at
>> >
>> org.apache.zookeeper.server.persistence.FileTxnSnapLog.restore(FileTxnSnapLog.java:151)
>> >        ... 6 more
>> >
>> >
>> > 2016-05-13 09:57:29,084 - ERROR [main:QuorumPeerMain@89] - Unexpected
>> > exception, exiting abnormally
>> > java.lang.RuntimeException: Unable to run quorum server
>> >        at
>> >
>> org.apache.zookeeper.server.quorum.QuorumPeer.loadDataBase(QuorumPeer.java:454)
>> >        at
>> > org.apache.zookeeper.server.quorum.QuorumPeer.start(QuorumPeer.java:409)
>> >        at
>> >
>> org.apache.zookeeper.server.quorum.QuorumPeerMain.runFromConfig(QuorumPeerMain.java:151)
>> >        at
>> >
>> org.apache.zookeeper.server.quorum.QuorumPeerMain.initializeAndRun(QuorumPeerMain.java:111)
>> >        at
>> >
>> org.apache.zookeeper.server.quorum.QuorumPeerMain.main(QuorumPeerMain.java:78)
>> > Caused by: java.io.IOException: Failed to process transaction type: 1
>> > error: KeeperErrorCode = NoNode for /brokers/ids
>> >        at
>> >
>> org.apache.zookeeper.server.persistence.FileTxnSnapLog.restore(FileTxnSnapLog.java:153)
>> >        at
>> > org.apache.zookeeper.server.ZKDatabase.loadDataBase(ZKDatabase.java:223)
>> >        at
>> >
>> org.apache.zookeeper.server.quorum.QuorumPeer.loadDataBase(QuorumPeer.java:417)
>> >        ... 4 more
>> > Caused by: org.apache.zookeeper.KeeperException$NoNodeException:
>> > KeeperErrorCode = NoNode for /brokers/ids
>> >        at
>> >
>> org.apache.zookeeper.server.persistence.FileTxnSnapLog.processTransaction(FileTxnSnapLog.java:211)
>> >        at
>> >
>> org.apache.zookeeper.server.persistence.FileTxnSnapLog.restore(FileTxnSnapLog.java:151)
>> >        ... 6 more
>> >
>> >
>> > SECOND: 76 :::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::
>> > ING (n.state), 3 (n.sid), 0x1 (n.peerEPoch), LOOKING (my state)
>> > 2016-05-13 09:42:40,650 - WARN
>> > [RecvWorker:1:QuorumCnxManager$RecvWorker@762] - Connection broken for
>> id
>> > 1, my id = 2, error =
>> > java.io.EOFException
>> >        at java.io.DataInputStream.readInt(DataInputStream.java:392)
>> >        at
>> >
>> org.apache.zookeeper.server.quorum.QuorumCnxManager$RecvWorker.run(QuorumCnxManager.java:747)
>> > 2016-05-13 09:42:40,650 - WARN
>> > [RecvWorker:1:QuorumCnxManager$RecvWorker@765] - Interrupting
>> SendWorker
>> > 2016-05-13 09:42:40,651 - WARN
>> > [SendWorker:1:QuorumCnxManager$SendWorker@679] - Interrupted while
>> waiting
>> > for message on queue
>> > java.lang.InterruptedException
>> >        at
>> >
>> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.reportInterruptAfterWait(AbstractQueuedSynchronizer.java:2017)
>> >        at
>> >
>> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitNanos(AbstractQueuedSynchronizer.java:2095)
>> >        at
>> >
>> java.util.concurrent.ArrayBlockingQueue.poll(ArrayBlockingQueue.java:389)
>> >        at
>> >
>> org.apache.zookeeper.server.quorum.QuorumCnxManager.pollSendQueue(QuorumCnxManager.java:831)
>> >        at
>> >
>> org.apache.zookeeper.server.quorum.QuorumCnxManager.access$500(QuorumCnxManager.java:62)
>> >        at
>> >
>> org.apache.zookeeper.server.quorum.QuorumCnxManager$SendWorker.run(QuorumCnxManager.java:667)
>> > 2016-05-13 09:42:40,651 - WARN
>> > [SendWorker:1:QuorumCnxManager$SendWorker@688] - Send worker leaving
>> threa
>> >
>> >
>> > ..... then these ...............
>> >
>> > ==> /var/log/zookeeper/zookeeper.log <==
>> > 2016-05-13 10:01:20,334 - INFO  [NIOServerCxn.Factory:
>> > 0.0.0.0/0.0.0.0:2181:NIOServerCnxnFactory@197] - Accepted socket
>> connection
>> > from /X.Y.Z.75:58954
>> > 2016-05-13 10:01:20,334 - WARN  [NIOServerCxn.Factory:
>> > 0.0.0.0/0.0.0.0:2181:NIOServerCnxn@354] - Exception causing close of
>> > session 0x0 due to java.io.IOException: ZooKeeperServer not running
>> > 2016-05-13 10:01:20,335 - INFO  [NIOServerCxn.Factory:
>> > 0.0.0.0/0.0.0.0:2181:NIOServerCnxn@1001] - Closed socket connection for
>> > client /X.Y.Z.75:58954 (no session established for client)
>> >
>> > ==> /home/kafka/kafka/kafka.log <==
>> > [2016-05-13 10:01:20,412] INFO Opening socket connection to server
>> > X.Y.Z.75/X.Y.Z.75:2181. Will not attempt to authenticate using SASL
>> > (unknown error) (org.apache.zookeeper.ClientCnxn)
>> > [2016-05-13 10:01:20,413] INFO Socket connection established to
>> > X.Y.Z.75/X.Y.Z.75:2181, initiating session
>> (org.apache.zookeeper.ClientCnxn)
>> > [2016-05-13 10:01:20,637] WARN Session 0x254a9245fc00000 for server
>> > X.Y.Z.75/X.Y.Z.75:2181, unexpected error, closing socket connection and
>> > attempting reconnect (org.apache.zookeeper.ClientCnxn)
>> > java.io.IOException: Connection reset by peer
>> >        at sun.nio.ch.FileDispatcherImpl.read0(Native Method)
>> >        at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:39)
>> >        at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:223)
>> >        at sun.nio.ch.IOUtil.read(IOUtil.java:192)
>> >        at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:384)
>> >        at
>> >
>> org.apache.zookeeper.ClientCnxnSocketNIO.doIO(ClientCnxnSocketNIO.java:68)
>> >        at
>> >
>> org.apache.zookeeper.ClientCnxnSocketNIO.doTransport(ClientCnxnSocketNIO.java:366)
>> >        at
>> > org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1081)
>> > [2016-05-13 10:01:21,782] INFO Opening socket connection to server
>> > X.Y.Z.76/X.Y.Z.76:2181. Will not attempt to authenticate using SASL
>> > (unknown error) (org.apache.zookeeper.ClientCnxn)
>> >
>> >
>> >
>> >
>> >
>> > THIRD - added last::::::::::::::::::::::::::::::::::::::::
>> >
>> > LOWING (n.state), 2 (n.sid), 0x1 (n.peerEPoch), LOOKING (my state)
>> > 2016-05-13 03:03:39,540 - INFO
>> > [QuorumPeer[myid=3]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@774] -
>> > Notification time out: 25600
>> > 2016-05-13 03:03:39,569 - WARN
>> [WorkerSender[myid=3]:QuorumCnxManager@368]
>> > - Cannot open channel to 1 at election address /X.Y.Z.75:3888
>> > java.net.ConnectException: Connection refused
>> >        at java.net.PlainSocketImpl.socketConnect(Native Method)
>> >        at
>> >
>> java.net.AbstractPlainSocketImpl.doConnect(AbstractPlainSocketImpl.java:339)
>> >        at
>> >
>> java.net.AbstractPlainSocketImpl.connectToAddress(AbstractPlainSocketImpl.java:200)
>> >        at
>> >
>> java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:182)
>> >        at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:392)
>> >        at java.net.Socket.connect(Socket.java:579)
>> >        at
>> >
>> org.apache.zookeeper.server.quorum.QuorumCnxManager.connectOne(QuorumCnxManager.java:354)
>> >        at
>> >
>> org.apache.zookeeper.server.quorum.QuorumCnxManager.toSend(QuorumCnxManager.java:327)
>> >        at
>> >
>> org.apache.zookeeper.server.quorum.FastLeaderElection$Messenger$WorkerSender.process(FastLeaderElection.java:393)
>> >        at
>> >
>> org.apache.zookeeper.server.quorum.FastLeaderElection$Messenger$WorkerSender.run(FastLeaderElection.java:365)
>> >        at java.lang.Thread.run(Thread.java:745)
>> > 2016-05-13 03:03:39,570 - INFO
>> > [WorkerReceiver[myid=3]:FastLeaderElection@542] - Notification: 3
>> > (n.leader), 0x100000052 (n.zxid), 0x108d2 (n.round), LOOKING (n.state),
>> 3
>> > (n.sid), 0x1 (n.peerEPoch), LOOKING (my state)
>> > 2016-05-13 03:03:39,596 - INFO
>> > [WorkerReceiver[myid=3]:FastLeaderElection@542] - Notification: 3
>> > (n.leader), 0x100000052 (n.zxid), 0x108d1 (n.round), FOLLOWING
>> (n.state), 2
>> > (n.sid), 0x1 (n.peerEPoch), LOOKING (my state)
>> > 2016-05-13 03:03:47,801 - INFO
>> > [WorkerReceiver[myid=3]:FastLeaderElection@542] - Notification: 2
>> > (n.leader), 0x100000052 (n.zxid), 0x108d2 (n.round), LOOKING (n.state),
>> 2
>> > (n.sid), 0x1 (n.peerEPoch), LOOKING (my state)
>> > 2016-05-13 03:03:48,013 - INFO
>> > [WorkerReceiver[myid=3]:FastLeaderElection@542] - Notification: 2
>> > (n.leader), 0x100000052 (n.zxid), 0x108d2 (n.round), LOOKING (n.state),
>> 2
>> > (n.sid), 0x1 (n.peerEPoch), LOOKING (my state)
>> > 2016-05-13 03:03:48,415 - INFO
>> > [WorkerReceiver[myid=3]:FastLeaderElection@542] - Notification: 2
>> > (n.leader), 0x100000052 (n.zxid), 0x108d2 (n.round), LOOKING (n.state),
>> 2
>> > (n.sid), 0x1 (n.peerEPoch), LOOKING (my state)
>> > 2016-05-13 03:03:49,216 - INFO
>> > [WorkerReceiver[myid=3]:FastLeaderElection@542] - Notification: 2
>> > (n.leader), 0x100000052 (n.zxid), 0x108d2 (n.round), LOOKING (n.state),
>> 2
>> > (n.sid), 0x1 (n.peerEPoch), LOOKING (my state)
>> > 2016-05-13 03:03:50,818 - INFO
>> > [WorkerReceiver[myid=3]:FastLeaderElection@542] - Notification: 2
>> > (n.leader), 0x100000052 (n.zxid), 0x108d2 (n.round), LOOKING (n.state),
>> 2
>> > (n.sid), 0x1 (n.peerEPoch), LOOKING (my state)
>>
>>
>
>
> --
> Radha Krishna, Proddaturi
> 253-234-5657
>



-- 
Radha Krishna, Proddaturi
253-234-5657

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message