zookeeper-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Stephen Tyree <tyree...@gmail.com>
Subject Re: libzookeeper_mt and GDB
Date Mon, 18 Jul 2011 17:07:46 GMT
Thanks for the replies everyone. re: Camillie, thanks for the link to the
RPD, there is a lot of valuable insight in there that will help for solving
this problem for us. I agree that saying "im in a debugger, dont expire me"
is a bad idea for a number of reasons (what if the process exits before
saying otherwise?). I agree that developers should be aware of GDB non-stop
mode and use it when it appropriate, but there are plenty of valid reasons
in an application to use GDB all-stop mode for debugging. Having Zookeeper
as a subsystem of their processes without some sort of other solution to
extend sessions basically removes that valuable debugging option from the

re:Ted The second option I've considered before (GDB macros). You can script
GDB with python to add special commands to it, so I presume that means you
can do things like access thread backtraces and identify particular threads.
In all-stop mode you can asynchronously continue specific threads, so it
would be possible to script commands which keep ZK running. Whether you can
hook into breakpoints and continues is something I'm not sure of. This seems
like functionality people would like in general however, so I wonder if it's
possible to modify all-stop mode to allow for specification of threads which
don't stop at other thread's breakpoints. Based on the design documents for
all-stop mode it doesn't seem that way, but it would be nice.

On Mon, Jul 18, 2011 at 12:09 PM, Ted Dunning <ted.dunning@gmail.com> wrote:

> I haven't used gdb in a bunch of years and looking at the manual, I don't
> see a way to continue a single thread.  That makes my second suggestion
> silly unless there is something I didn't see (which is decidedly possible).
> On Mon, Jul 18, 2011 at 9:05 AM, Ted Dunning <ted.dunning@gmail.com>
> wrote:
> > I have two suggestions that might or might not work.
> >
> > First, you can increase the timeouts to high values and also write a bit
> of
> > code that can expire the session instantly.  The ZK unit tests have
> examples
> > of how to do this by opening a second connection with the same session id
> > and then closing it.  This has the effect of instantly expiring the
> original
> > connection.  You still have a bit of an education process here.  This is
> > high risk since the configuration file with the long timeouts will
> probably
> > get checked in by mistake at some point.  There might be a way to avoid
> this
> > with a special startup option that over-rides the session length for just
> > the one invocation.
> >
> > A second idea is that you might be able to define a gdb macro that is
> > invoked when you hit a breakpoint and another that is invoked at continue
> > time (or manually).  The first macro would invoke a function to start or
> > continue a background thread that can keep the heartbeats going.  The
> second
> > macro would kill that thread and restore normal operation.  The ideal
> case
> > would be to continue just the normal ZK heartbeat thread except that
> might
> > cause notifications to be called in the background which could confuse
> the
> > person doing the debugging.
> >
> > If you can make it work, the second approach would give you something
> > approaching a normal debugging experience.
> >
> >
> > On Mon, Jul 18, 2011 at 7:41 AM, Fournier, Camille F. <
> > Camille.Fournier@gs.com> wrote:
> >
> >> ZooKeeper can't possibly know that you are in GDB unless you have a
> >> special message that you send to the server that says "I'm in a debugger
> >> now, please don't expire me". You might be able to hack something in to
> do
> >> this, but do you really want to? I think the second idea is best. If you
> are
> >> a developer working in any kind of multi-threaded distributed system,
> you
> >> need to be aware that suspending all threads can lead to the remote
> parts of
> >> your process failing. That's just professional distributed systems
> >> development 101. This isn't unique to C, Java developers also have to
> choose
> >> between suspending all threads during debugging and suspending only the
> >> thread affected by the breakpoint.
> >>
> >> You can also split the difference between points one and two, namely,
> get
> >> the message out to the developers that if they're working against ZK and
> >> suspend all threads, they might end up losing their session, but when
> >> working in an env that you expect to do a lot of debugging in
> (development,
> >> QA), jack up the timeout so it happens less frequently.
> >>
> >> If you truly want to separate the process from its zookeeper
> heartbeating,
> >> you could take a tip from the HBASE devs in
> >> https://issues.apache.org/jira/browse/HBASE-1316. Because dealing with
> >> timeouts is much more of an issue in large Java processes due to full
> GC,
> >> they have experimented with various solutions that you might be able to
> >> apply here in C.
> >>
> >> C
> >>
> >>
> >> -----Original Message-----
> >> From: Stephen Tyree [mailto:tyree731@gmail.com]
> >> Sent: Monday, July 18, 2011 10:07 AM
> >> To: user@zookeeper.apache.org
> >> Subject: libzookeeper_mt and GDB
> >>
> >> Hello All,
> >>
> >> I've been using Zookeeper at my place of work for a few months now
> >> successfully, but there has been a lingering issue I haven't been able
> >> to solve without issue. Namely, when using GDB with libzookeeper_mt,
> >> once you hit a breakpoint, the program you're running essentially has
> >> until the session timeout to continue onward or its session will be
> >> expired. This is a pain in the butt when using ephemeral znodes, but in
> >> my case those ephemeral znodes are tied to locks which means losing them
> >> is bad news. I've tried a number of different ideas to solve this issue,
> >> and all of them have varying degrees of success.
> >>
> >> The first idea I had was jacking up the session timeouts, which
> >> obviously works. This extends the time you have at any given breakpoint
> >> to figure out the issue and move onward, but comes at the expense of
> >> ephemeral znodes living for much longer than they reasonably should when
> >> the program crashes (something that is likely to be an issue if you're
> >> using GDB). In the case of locking, those znodes which hang around for a
> >> while have negative consequences on the performance of the system. This
> >> is how we currently deal with the issue.
> >>
> >> The second idea was to instruct all developers at my job to use GDB
> >> non-stop mode for debugging. This works, since GDB would only stop the
> >> thread which hit a breakpoint in this mode, but runs into the issue that
> >> I need to change the development habits of hundreds of engineers just to
> >> save myself the trouble. Ideally Zookeeper would function with GDB in
> >> whatever mode you felt like using.
> >>
> >> The third idea was decidedly more intricate. Essentially I spawn a
> >> subprocess which uses the exact same session I do, but only holds onto
> >> that session while the parent process is unresponsive (at a breakpoint
> >> probably). This essentially locks your session while at breakpoints, but
> >> has no impact while not at breakpoints. The only caveat to this approach
> >> is the transition between breakpoints and non-breakpoints. Since the
> >> server last saw the session in the subprocess, it doesn't send heartbeat
> >> messages to the parent process. This means it's up to the parent process
> >> to send PING messages to the server in order to reestablish the session,
> >> but this only happens at 1/3 of the session timeout (which is too long).
> >>
> >> Whatever the case, a simple, generic solution would be ideal for this
> >> situation. It might be as simple as allowing configurable PING messages
> >> (for the third solution) or it might be as frustrating as creating a
> >> Zookeeper service which runs outside of the process (thus bypassing
> >> GDB's breakpoints). Any ideas?
> >>
> >> Thanks,
> >> Stephen Tyree
> >>
> >
> >

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message