zookeeper-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Patrick Hunt <ph...@apache.org>
Subject Re: ZK Client won't time out when quorum irrevocably goes away
Date Fri, 04 Feb 2011 00:38:52 GMT
On Thu, Feb 3, 2011 at 4:06 PM, Ryan Rawson <ryanobjc@gmail.com> wrote:
> Yes thanks, I'm a little loose with the language not realizing there
> are specific states, etc.
>

No worries, I just wanted to make sure I understood the situation correctly.

> So in our scenario here, the quorum has moved, the clients will never
> (a) reconnect ever again and (b) not be able to find the new quorum
> location because IP addresses are cached.  If either:

That's correct. You've hit this issue:
https://issues.apache.org/jira/browse/ZOOKEEPER-338

> - The client refreshed from DNS (although the JVM seems to have a DNS
> cache which has hosed us as well)
> - The client expires the session
>
> We might have been in a better situation.

Perhaps. Did you re-create the ZK database or did you re-use the
existing datastore? (ie moved them from the old to the new server)

>
> Reading the FAQ, it seems like the onus might be on the client to
> check for session disconnect and compare it against the negotiated
> session timeout to determine "oh hey we havent talked to ZK in a
> while, lets quit".  Is that an expected client task?

No, the expectation is that the client will eventually reconnect and
see the session as expired. That's the typical case with network
partitioning or the cluster being brought down, etc...

This is a special case due to ZOOKEEPER-338. For this scenario the
only "official" recourse at this time is to restart the client. (but
depending on your answer to my previous question you might need to do
that anyway)

I supposed that the client could close it's session if it sees that
the disconnect happened long enough ago (the session timeout + some
safety factor). But this really is a special case (and 338 should be
implemented to address).

Patrick

>
> Thanks for the quick reply!
> -ryan
>
> On Thu, Feb 3, 2011 at 4:01 PM, Patrick Hunt <phunt@apache.org> wrote:
>> On Thu, Feb 3, 2011 at 2:57 PM, Ryan Rawson <ryanobjc@gmail.com> wrote:
>>> The result was the client never realized that it's session was
>>> actually timed out, and the HBase processes continued to run. Kill -9
>>> and a restart fixed it.
>>
>> Hi Ryan,
>>
>> there are two issues at play here, session timeout and session
>> expiration. Correct me if I'm wrong but I think you meant to say "the
>> client never realized that it's session was actually _expired_". Which
>> is correct behavior. Clients can only determine if a session is
>> expired once they reconnect to the cluster. Session timeout on the
>> other hand happens when the server heartbeat is not received by the
>> client w/in the session timeout period. Clients who are disconnected
>> from the cluster will attempt to reconnect back to the cluster until
>> they are successful. When a client is disconnected the client's
>> watchers will be notified about the disconnect. (same for expiration).
>>
>> See questions 1 & 2 here in the faq, specifically "Example state
>> transitions" in question 2:
>> https://cwiki.apache.org/confluence/display/ZOOKEEPER/FAQ
>> Your clients were stuck btw steps 4 and 5 (which they will never reach
>> in your scenario).
>>
>> Does that help?
>>
>> Patrick
>>
>

Mime
View raw message