zookeeper-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Fournier, Camille F." <Camille.Fourn...@gs.com>
Subject RE: devops/admin/client question: What do you do when you rollback?
Date Fri, 05 Aug 2011 16:01:09 GMT
Actuallly.... can I update the ConnectRequest protocol version number? If I can do that, I
can have the server only send back the indicating ConnectResponse on clients with a higher
protocol version. It doesn't look like it's read anywhere right now.
(Moving this to dev since we've moved to a dev discussion)

C

-----Original Message-----
From: Fournier, Camille F. [Tech] 
Sent: Friday, August 05, 2011 11:57 AM
To: 'user@zookeeper.apache.org'
Subject: RE: devops/admin/client question: What do you do when you rollback?

Hmmm. I thought I had another way around this but I don't. We really didn't write the client
to be easy to encode other errors in the connection result... I think any good solution will
have to be in our 4.0 clojure rewrite ;)

C


-----Original Message-----
From: Ted Dunning [mailto:ted.dunning@gmail.com] 
Sent: Friday, August 05, 2011 11:51 AM
To: user@zookeeper.apache.org
Subject: Re: devops/admin/client question: What do you do when you rollback?

If you get the lower zxid from the leader then you know that things have
gone south.

Likewise, if you get a lower epoch number from a node that thinks that it is
in quorum then things are not good.  The definition of "thinks it is in
quorum" is problematic of course.

On Fri, Aug 5, 2011 at 10:57 AM, Fournier, Camille F. <
Camille.Fournier@gs.com> wrote:

> Oh blah, of course it won't be b/w compatible, because all the older
> clients would expire their sessions in the instance of a single zxid higher
> than the cluster zxid which I doubt most people want.
>
> Is there a way to check if the zxid of the client is higher than the
> current possible zxid after connection, and send the session_expired then?
> That would at least help us out most of the way.
>
> -----Original Message-----
> From: Patrick Hunt [mailto:phunt@apache.org]
> Sent: Thursday, August 04, 2011 7:23 PM
> To: user@zookeeper.apache.org
> Subject: Re: devops/admin/client question: What do you do when you
> rollback?
>
> Sounds reasonable to me as long as it's b/w compatible (which it seems
> like it would be), anything we can do to improve this situation would
> be huge - I frequently see our support team trying to address this
> (e.g. the max count exceeded issue) with clients like hbase. Def plus
> for supportability.
>
> Patrick
>
> On Thu, Aug 4, 2011 at 4:11 PM, Camille Fournier <camille@apache.org>
> wrote:
> > I'm thinking of hacking it through the connectresponse session timeout
> > (similar to the way we detect session rejected). I wrote up a prototype
> that
> > worked ok this way. Might could extend this hack to other things, using
> that
> > field as an encoded error msg, thoughts?
> >
> > C
> > On Aug 4, 2011 6:10 PM, "Patrick Hunt" <phunt@apache.org> wrote:
> >> Our error reporting server->client has always been weak. It's a PITA
> >> to debug in production because a lot of times when the client gets
> >> bounced it's not clear from the client side why (you end up having to
> >> search the server log - for example when maxClientCount is exceeded).
> >> It would be great to fix this, esp if the server could provide insight
> >> to the client about why (an error code/message perhaps). Doing it in a
> >> b/w compatible way might be tough though...
> >>
> >> Patrick
> >>
> >> On Thu, Aug 4, 2011 at 2:45 PM, Ted Dunning <ted.dunning@gmail.com>
> wrote:
> >>> This is used normally to guarantee in-order data views.  If you get
> >>> disconnected from one host in an advanced state and then connect to an
> > out
> >>> of date slave, ZK automatically disconnects you to avoid letting you
> see
> >>> time go backwards.  Your situation is different of course.
> >>>
> >>>
> >>>
> >>> On Thu, Aug 4, 2011 at 7:05 PM, Fournier, Camille F. <
> >>> Camille.Fournier@gs.com> wrote:
> >>>
> >>>> Right now the server just detects that the zxid is wrong, and calls
> > close
> >>>> on the client. The client logs:
> >>>> 15:01:47,593 - INFO
> >>>>  [main-SendThread(localhost:2181):ClientCnxn$SendThread@1159] -
> Unable
> > to
> >>>> read additional data from server sessionid 0x131962b00540000, likely
> > server
> >>>> has closed socket, closing socket connection and attempting reconnect
> >>>> (branch 3.3.3)
> >>>>
> >>>> I will poke around and see if I can figure out a nicer way to indicate
> > this
> >>>> condition. The expired state is perfectly fine for me in my use case.
> >>>>
> >>>> C
> >>>>
> >>>>
> >>>> -----Original Message-----
> >>>> From: Patrick Hunt [mailto:phunt@apache.org]
> >>>> Sent: Thursday, August 04, 2011 1:51 PM
> >>>> To: user@zookeeper.apache.org
> >>>> Subject: Re: devops/admin/client question: What do you do when you
> >>>> rollback?
> >>>>
> >>>> On Thu, Aug 4, 2011 at 10:29 AM, Fournier, Camille F.
> >>>> <Camille.Fournier@gs.com> wrote:
> >>>> > We had an issue here the other day where the ZK servers were running
> >>>> poorly, and in an effort to get them healthy again we ended up rolling
> > back
> >>>> the cluster state. While this was, in retrospect, not the right
> solution
> > to
> >>>> the problem we were facing, it brought up another problem. Namely,
> that
> > many
> >>>> of our clients couldn't reconnect with their sessions because their
> zxid
> > was
> >>>> too high (expected), but that the error they got when trying to do
> that
> >>>> reconnection was just a vanilla disconnected error. The result was
> that
> > most
> >>>> of our clients had to be bounced.
> >>>>
> >>>> Hi Camille, there's a long standing jira on this:
> >>>> https://issues.apache.org/jira/browse/ZOOKEEPER-523
> >>>>
> >>>> > Aside from trying hard to avoid ever rolling back the cluster state,
> > does
> >>>> anyone have a way they deal with this situation if it occurs? Should
> we
> >>>> consider enhancing the error message to the client so we could track
> the
> >>>> fact that we were ahead of the quorum zxid and react sensibly?
> > Alternately,
> >>>> since we were sending a sessionId along with the zxid, perhaps it
> would
> > be
> >>>> nice to check to see if the sessionId exists before checking the zxid,
> > which
> >>>> would send an expired state signal which my client code could handle
> >>>> cleanly.
> >>>>
> >>>> It seems reasonable that if the client connects to all servers in the
> >>>> ensemble (that it knows about) and sees that it's ahead of each one,
> >>>> it should consider the session expired (we could add a new state, but
> >>>> seems like just treating as expired with a good log message would be
> >>>> better from b/w compat standpoint).
> >>>>
> >>>> I can't recall, does the client have sufficient information to make
> >>>> this determination, or is the server just disconnecting?
> >>>>
> >>>> Patrick
> >>>>
> >>>
> >
>
Mime
View raw message