zookeeper-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Patrick Hunt <ph...@apache.org>
Subject Re: devops/admin/client question: What do you do when you rollback?
Date Thu, 04 Aug 2011 23:23:26 GMT
Sounds reasonable to me as long as it's b/w compatible (which it seems
like it would be), anything we can do to improve this situation would
be huge - I frequently see our support team trying to address this
(e.g. the max count exceeded issue) with clients like hbase. Def plus
for supportability.


On Thu, Aug 4, 2011 at 4:11 PM, Camille Fournier <camille@apache.org> wrote:
> I'm thinking of hacking it through the connectresponse session timeout
> (similar to the way we detect session rejected). I wrote up a prototype that
> worked ok this way. Might could extend this hack to other things, using that
> field as an encoded error msg, thoughts?
> C
> On Aug 4, 2011 6:10 PM, "Patrick Hunt" <phunt@apache.org> wrote:
>> Our error reporting server->client has always been weak. It's a PITA
>> to debug in production because a lot of times when the client gets
>> bounced it's not clear from the client side why (you end up having to
>> search the server log - for example when maxClientCount is exceeded).
>> It would be great to fix this, esp if the server could provide insight
>> to the client about why (an error code/message perhaps). Doing it in a
>> b/w compatible way might be tough though...
>> Patrick
>> On Thu, Aug 4, 2011 at 2:45 PM, Ted Dunning <ted.dunning@gmail.com> wrote:
>>> This is used normally to guarantee in-order data views.  If you get
>>> disconnected from one host in an advanced state and then connect to an
> out
>>> of date slave, ZK automatically disconnects you to avoid letting you see
>>> time go backwards.  Your situation is different of course.
>>> On Thu, Aug 4, 2011 at 7:05 PM, Fournier, Camille F. <
>>> Camille.Fournier@gs.com> wrote:
>>>> Right now the server just detects that the zxid is wrong, and calls
> close
>>>> on the client. The client logs:
>>>> 15:01:47,593 - INFO
>>>>  [main-SendThread(localhost:2181):ClientCnxn$SendThread@1159] - Unable
> to
>>>> read additional data from server sessionid 0x131962b00540000, likely
> server
>>>> has closed socket, closing socket connection and attempting reconnect
>>>> (branch 3.3.3)
>>>> I will poke around and see if I can figure out a nicer way to indicate
> this
>>>> condition. The expired state is perfectly fine for me in my use case.
>>>> C
>>>> -----Original Message-----
>>>> From: Patrick Hunt [mailto:phunt@apache.org]
>>>> Sent: Thursday, August 04, 2011 1:51 PM
>>>> To: user@zookeeper.apache.org
>>>> Subject: Re: devops/admin/client question: What do you do when you
>>>> rollback?
>>>> On Thu, Aug 4, 2011 at 10:29 AM, Fournier, Camille F.
>>>> <Camille.Fournier@gs.com> wrote:
>>>> > We had an issue here the other day where the ZK servers were running
>>>> poorly, and in an effort to get them healthy again we ended up rolling
> back
>>>> the cluster state. While this was, in retrospect, not the right solution
> to
>>>> the problem we were facing, it brought up another problem. Namely, that
> many
>>>> of our clients couldn't reconnect with their sessions because their zxid
> was
>>>> too high (expected), but that the error they got when trying to do that
>>>> reconnection was just a vanilla disconnected error. The result was that
> most
>>>> of our clients had to be bounced.
>>>> Hi Camille, there's a long standing jira on this:
>>>> https://issues.apache.org/jira/browse/ZOOKEEPER-523
>>>> > Aside from trying hard to avoid ever rolling back the cluster state,
> does
>>>> anyone have a way they deal with this situation if it occurs? Should we
>>>> consider enhancing the error message to the client so we could track the
>>>> fact that we were ahead of the quorum zxid and react sensibly?
> Alternately,
>>>> since we were sending a sessionId along with the zxid, perhaps it would
> be
>>>> nice to check to see if the sessionId exists before checking the zxid,
> which
>>>> would send an expired state signal which my client code could handle
>>>> cleanly.
>>>> It seems reasonable that if the client connects to all servers in the
>>>> ensemble (that it knows about) and sees that it's ahead of each one,
>>>> it should consider the session expired (we could add a new state, but
>>>> seems like just treating as expired with a good log message would be
>>>> better from b/w compat standpoint).
>>>> I can't recall, does the client have sufficient information to make
>>>> this determination, or is the server just disconnecting?
>>>> Patrick

View raw message