zookeeper-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Rakesh Radhakrishnan <rake...@apache.org>
Subject Re: Sessions Expire due to Network partitioning in Zookeeper
Date Thu, 02 Mar 2017 10:38:57 GMT
>>>> According to my understanding, it looks like, when a client trying to
>>>> connect to a server that it cannot connect due to a network
partitioning,
>>>> it uses a blocking call and it waits too much time trying to
>>>> connect to a server that it cannot communicate.

Actually, ZooKeeper client has retry mechanism.
Client sends a ping every 1/3 the session timeout (here, 3 is the no. of
listed servers, A, B, C)
and then looks for a response before another 1/3 elapses. This allows time
to reconnect to a
different server (and still maintain the session) if the connected server
becomes unavailable.

Could you grep the following log message in your client log and tell me how
much time C3 taken for the re-connection attempts.
"Client session timed out, have not heard from server in "

C3 might have first attempted to reconnect to B and then A. Also, need to
check how much time C3 taken to detect connection failure from server C.

Could you please share the zk client log to dig more.

Rakesh


On Thu, Mar 2, 2017 at 11:04 AM, Tharindu Kumara <zonik.hatkumara@gmail.com>
wrote:

>  > ​
> 1) Could you tell me the status of Server C, is this lost connection to the
>  >     quorum and fails to join quorum continuously as B is the Leader ?
>
> Yes, B the leader. C Server is completely isolated from the Leader(B)
> and It cannot communicate with the Leader. C cannot continuously connect to
> the
>
> Leader.
>
>
>  > 2) C3 is connected C. Please tell me the connection host string passed
> to
>  >     this client. Does it contains all three servers info "A:clientport,
>  >    B:clientport, C:clientport" ?
>
> Yes, C3's connection string contains all three servers. ("A:clientport,
> B:clientport, C:clientport")
>
>
>  > 3) Please check all three servers and client C3 logs to see any
>  >    inconsistencies or exceptions.
>
> After looking at logs, it seems when the server C isolated from the Leader,
>
> a disconnect event fires to client C3. Then it (C3) tries too much time to
> connect to Server B(Leader) .
>
> But it cannot connect to server B, as we blocked the connection between
> Server C and
>
> Server B. Basically, C3 tries more than half of the session timeout time to
> connect to Server B.
>
> Then after figuring out that C3 cannot to connect to Server B, it tries to
> connect
>
> to Server A, and it connects to Server A successfully. But this is too
> late, because
>
> session is already expired at the time C3 connected.
>
> And this happens sometimes only. Because when we specify all the servers in
> the client's
>
> connect string, sometimes after C3 disconnecting from Server C, instead of
> trying to connect to
>
> Server B it connects to Server A as the first attempt. In this case the
> client C3 connects to the
>
> quorum successfully before the session expiration.
>
> According to my understanding, it looks like, when a client trying to
> connect to a server that it cannot
>
> connect due to a network partitioning, it uses a blocking call and it waits
> too much time trying to
>
> connect to a server that it cannot communicate.
>
>
>
>  > 4) ZooKeeper version used in your testing ?
>
> I used zookeeper 3.4.9 (current stable release)
>
>
>
> On Thu, Mar 2, 2017 at 7:48 AM, Rakesh Radhakrishnan <rakeshr@apache.org>
> wrote:
>
> > Hi,
> >
> > Could you please give few more details,
> >
> > ​​
> > 1) Could you tell me the status of Server C, is this lost connection to
> the
> > quorum and fails to join quorum continuously as B is the Leader ?
> >
> > 2) C3 is connected C. Please tell me the connection host string passed to
> > this client. Does it contains all three servers info "A:clientport,
> > B:clientport, C:clientport" ?
> >
> > 3) Please check all three servers and client C3 logs to see any
> > inconsistencies or exceptions.
> >
> > 4) ZooKeeper version used in your testing ?
> >
> >
> > Rakesh
> >
> > On Wed, Mar 1, 2017 at 4:55 PM, Tharindu Kumara <
> zonik.hatkumara@gmail.com
> > >
> > wrote:
> >
> > > ​Recently, carried out a test to to find the behavior of clients when a
> > > client is partitioned from the ensemble.
> > >
> > > Here I used a ensemble of 3 zookeeper servers called A, B and C. And
> > quorum
> > > was set up like below.
> > >
> > > A - Follower
> > > B - Leader
> > > C - Follower​
> > >
> > > A  <---> B <---> C
> > >    \____________/
> > >
> > > And 3 clients are connected to ensemble like below.
> > >
> > > C1 is connected A
> > > C2 is connected B
> > > C3 is connected C.
> > >
> > > I used iptables to remove the network link between B and C.
> > >
> > > command used: iptables -I INPUT -s 123.123.45.123 -j DROP
> > >
> > > After removing the link connections looks like below.
> > >
> > > A  <----> B         C
> > >    \____________/
> > >
> > > Simply there is no way to communicate from B to C and vice versa.
> > >
> > > Here What I noticed is that the client connected to Zookeeper Server
> "C",
> > > could not connect to the ensemble resulting a session expiration
> timeout.
> > >
> > > For this experiment I used tickTime of 3000ms and client session
> > expiration
> > > timeout of 45000ms. And tested with different combinations also.
> > >
> > > Can someone please explain what is the root cause for this behavior?
> > >
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message