zookeeper-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Patrick Hunt <ph...@apache.org>
Subject Re: devops/admin/client question: What do you do when you rollback?
Date Thu, 04 Aug 2011 22:10:31 GMT
Our error reporting server->client has always been weak. It's a PITA
to debug in production because a lot of times when the client gets
bounced it's not clear from the client side why (you end up having to
search the server log - for example when maxClientCount is exceeded).
It would be great to fix this, esp if the server could provide insight
to the client about why (an error code/message perhaps). Doing it in a
b/w compatible way might be tough though...


On Thu, Aug 4, 2011 at 2:45 PM, Ted Dunning <ted.dunning@gmail.com> wrote:
> This is used normally to guarantee in-order data views.  If you get
> disconnected from one host in an advanced state and then connect to an out
> of date slave, ZK automatically disconnects you to avoid letting you see
> time go backwards.  Your situation is different of course.
> On Thu, Aug 4, 2011 at 7:05 PM, Fournier, Camille F. <
> Camille.Fournier@gs.com> wrote:
>> Right now the server just detects that the zxid is wrong, and calls close
>> on the client. The client logs:
>> 15:01:47,593 - INFO
>>  [main-SendThread(localhost:2181):ClientCnxn$SendThread@1159] - Unable to
>> read additional data from server sessionid 0x131962b00540000, likely server
>> has closed socket, closing socket connection and attempting reconnect
>> (branch 3.3.3)
>> I will poke around and see if I can figure out a nicer way to indicate this
>> condition. The expired state is perfectly fine for me in my use case.
>> C
>> -----Original Message-----
>> From: Patrick Hunt [mailto:phunt@apache.org]
>> Sent: Thursday, August 04, 2011 1:51 PM
>> To: user@zookeeper.apache.org
>> Subject: Re: devops/admin/client question: What do you do when you
>> rollback?
>> On Thu, Aug 4, 2011 at 10:29 AM, Fournier, Camille F.
>> <Camille.Fournier@gs.com> wrote:
>> > We had an issue here the other day where the ZK servers were running
>> poorly, and in an effort to get them healthy again we ended up rolling back
>> the cluster state. While this was, in retrospect, not the right solution to
>> the problem we were facing, it brought up another problem. Namely, that many
>> of our clients couldn't reconnect with their sessions because their zxid was
>> too high (expected), but that the error they got when trying to do that
>> reconnection was just a vanilla disconnected error. The result was that most
>> of our clients had to be bounced.
>> Hi Camille, there's a long standing jira on this:
>> https://issues.apache.org/jira/browse/ZOOKEEPER-523
>> > Aside from trying hard to avoid ever rolling back the cluster state, does
>> anyone have a way they deal with this situation if it occurs? Should we
>> consider enhancing the error message to the client so we could track the
>> fact that we were ahead of the quorum zxid and react sensibly? Alternately,
>> since we were sending a sessionId along with the zxid, perhaps it would be
>> nice to check to see if the sessionId exists before checking the zxid, which
>> would send an expired state signal which my client code could handle
>> cleanly.
>> It seems reasonable that if the client connects to all servers in the
>> ensemble (that it knows about) and sees that it's ahead of each one,
>> it should consider the session expired (we could add a new state, but
>> seems like just treating as expired with a good log message would be
>> better from b/w compat standpoint).
>> I can't recall, does the client have sufficient information to make
>> this determination, or is the server just disconnecting?
>> Patrick

View raw message