hbase-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Jim Kellerman (POWERSET)" <Jim.Keller...@microsoft.com>
Subject RE: ZK rethink?
Date Tue, 07 Apr 2009 23:25:32 GMT
Ryan,

Good idea! Because JVM thread scheduling is not fair, nor does
it pay attention to thread priorities, Java is highly susceptible
to thread starvation which can lead to HRS not being able to report
to the master (zookeeper) prior to the lease timeout.

---
Jim Kellerman, Powerset (Live Search, Microsoft Corporation)


> -----Original Message-----
> From: Ryan Rawson [mailto:ryanobjc@gmail.com]
> Sent: Tuesday, April 07, 2009 1:54 PM
> To: hbase-dev@hadoop.apache.org
> Cc: joey42+reply@gmail.com
> Subject: Re: ZK rethink?
> 
> Thanks for the input Joey, and may I be the first to say "holy
> shit".
> 
> The reason their approach works is because the C API spins off OS
> threads
> that exist outside the domain of the Java VM, which means those
> threads
> never get paused for GC processing.
> 
> With that kind of input, we might want to consider doing what he
> did.  Maybe
> you can donate a bit of code?
> 
> Thanks!
> -ryan
> 
> On Tue, Apr 7, 2009 at 1:49 PM, Nitay <nitayj@gmail.com> wrote:
> 
> > Very interesting Joey. Thanks for replying with this information.
> Also,
> > welcome! :).
> >
> > I don't quite understand why the C API with JNI fixes the problem.
> Did that
> > substantially reduce your tiny, short lived objects to the point
> where the
> > GC wasn't starving the ZooKeeper IO threads anymore?
> >
> > Perhaps my initial 10 second value was not enough. Andrew, can you
> try 30
> > or
> > 60 seconds as a test on your cluster to see if that calms things
> down?
> >
> > -n
> >
> > On Tue, Apr 7, 2009 at 1:43 PM, Joey Echeverria <joey42@gmail.com>
> wrote:
> >
> > > Long time lurker, first time poster.
> > >
> > > We've used zookeeper in a write-heavy project we've been working
> on
> > > and experienced issues similar to what you described. After
> several
> > > days of debugging, we discovered that our issue was garbage
> > > collection. There was no way to guarantee we wouldn't have long
> pauses
> > > especially since our environment was the worst case for garbage
> > > collection, millions of tiny, short lived objects. I suspect
> HBase
> > > sees similar work loads frequently, if it's not constantly. With
> > > anything shorter than a 30 second session time out, we got
> session
> > > expiration events extremely frequently. We needed to use 60
> seconds
> > > for any real confidence that an ephemeral node disappearing
> meant
> > > something was unavailable.
> > >
> > > We really wanted quick recovery so we ended up writing a light-
> weight
> > > wrapper around the C API and used swig to auto-generate a JNI
> > > interface. It's not perfect, but since we switched to this
> method
> > > we've never seen a session expiration event and ephemeral nodes
> only
> > > disappear when there are network issues or a machine/process
> goes
> > > down.
> > >
> > > I don't know if it's worth doing the same kind of thing for
> HBase as
> > > it adds some "unnecessary" native code, but it's a solution that
> I
> > > found works.
> > >
> > > On Tue, Apr 7, 2009 at 9:28 PM, Jim Kellerman (POWERSET)
> > > <Jim.Kellerman@microsoft.com> wrote:
> > > > There are a number of reasons why Zookeeper could receive a
> > > SessionExpired
> > > > event:
> > > > - The process died
> > > > - The machine died
> > > > - The is/was a network partitioning
> > > > - The network is flapping
> > > >
> > > > This is why the lease timeout is set to 2 minutes by default.
> If things
> > > > haven't recovered in two minutes, we assume that the region
> server is
> > > > dead, hung or in any event, unresponsive. Maybe we should add
> an API
> > > > to the region server such that the Master (or Zookeeper) could
> call it
> > > > and ask if it is still alive, before starting region server
> recovery
> > > > (ProcessServerShutdown).
> > > >
> > > > ---
> > > > Jim Kellerman, Powerset (Live Search, Microsoft Corporation)
> > > >
> > > >
> > > >> -----Original Message-----
> > > >> From: Nitay [mailto:nitayj@gmail.com]
> > > >> Sent: Tuesday, April 07, 2009 1:13 PM
> > > >> To: hbase-dev@hadoop.apache.org; apurtell@apache.org
> > > >> Subject: Re: ZK rethink?
> > > >>
> > > >> Hi Andrew,
> > > >>
> > > >> I agree with you that getting a SessionExpired is a problem
> for us,
> > > >> and we
> > > >> didn't really consider it when we initially put in the
> ZooKeeper
> > > >> code.
> > > >> However, I don't necessarily think a complete rethink is
> necessary.
> > > >>
> > > >> The main issue here is how often a SessionExpired is going to
> > > >> happen, and
> > > >> why it is happening that often. Most people using ZooKeeper
> use a
> > > >> session
> > > >> timeout of 2 or 3 seconds. A SessionExpired occurs when you
> lose
> > > >> connection
> > > >> to the ZooKeeper instance you were talking to and are unable
> to
> > > >> connect to
> > > >> another one within this time frame. In HBase, we use 10
> seconds for
> > > >> this
> > > >> interval. Given that, I think we should do some recon work
> first to
> > > >> determine what's going on. When does it happen? Why? Is the
> > > >> ZooKeeper IO
> > > >> thread getting starved for long periods of time? Can we
> prevent it?
> > > >> The
> > > >> ZooKeeper folks describe SessionExpired as a very, very rare
> event,
> > > >> yet that
> > > >> does not seem to be the case for us.
> > > >>
> > > >> Issues like HBASE-1314 are certainly a bug. If we think a
> node is
> > > >> dead
> > > >> because its ephemeral ZNode has vanished we should not try
> talking
> > > >> to it
> > > >> anymore. We cannot have a case where we both think it's dead
> and are
> > > >> talking
> > > >> to.
> > > >>
> > > >> If, after some investigation, we come to the conclusion that
> these
> > > >> SessionExpired events are unavoidable things that will happen
> quite
> > > >> frequently, then yes I think something like what you suggest
> is a
> > > >> good idea.
> > > >> But if these events only really do happen once in a blue moon
> as it
> > > >> seems
> > > >> they're supposed to, then perhaps simply internally
> restarting the
> > > >> node in
> > > >> question is not so bad?
> > > >>
> > > >> Within the solutions you propose I would opt for the timer
> option. I
> > > >> don't
> > > >> think that not using ephemeral nodes with watches is a good
> > > >> solution. It
> > > >> shifts us away from using the power that ZooKeeper provides.
> > > >> Assuming at
> > > >> some point ZooKeeper gets more reliable with its sessions, we
> will
> > > >> have a
> > > >> lot of code to change if we want to undo the decision.
> > > >>
> > > >> Regardless of what we end up going with, we need to do
> _something_
> > > >> on the
> > > >> RS/master when they get a SessionExpired, because we
> currently will
> > > >> get
> > > >> wedged. That's what I'm working on right now (HBASE-1311,
> HBASE-
> > > >> 1312).
> > > >>
> > > >> Thanks for bringing this up Andrew. I'm glad we have a
> cluster like
> > > >> yours to
> > > >> bring out these sorts of problems. I look forward to further
> > > >> discussion on
> > > >> this topic and hearing other people's thoughts.
> > > >>
> > > >> Cheers,
> > > >> -n
> > > >>
> > > >> On Tue, Apr 7, 2009 at 11:10 AM, Andrew Purtell
> > > >> <apurtell@apache.org> wrote:
> > > >>
> > > >> >
> > > >> > Hi Chad,
> > > >> >
> > > >> > In my testing the session expiration happens due to missed
> IO
> > > >> > like as with ZOOKEEPER-344, which is currently open.
> > > >> >
> > > >> >  https://issues.apache.org/jira/browse/ZOOKEEPER-344
> > > >> >
> > > >> > Also a Google search for "zookeeper session expired" turns
> up
> > > >> > some conversation already on the topic.
> > > >> >
> > > >> >  - Andy
> > > >> >
> > > >> >
> > > >> > > From: Chad Walters
> > > >> > > Subject: RE: ZK rethink?
> > > >> > > To: "hbase-dev@hadoop.apache.org" <hbase-
> dev@hadoop.apache.org>
> > > >> > > Date: Tuesday, April 7, 2009, 10:57 AM
> > > >> > >
> > > >> > > Has this been discussed at all with the ZooKeeper
> > > >> > > developers?
> > > >> > >
> > > >> > > Chad
> > > >> > >
> > > >> > > -----Original Message-----
> > > >> > > From: Andrew Purtell [mailto:apurtell@apache.org]
> > > >> > > Sent: Tuesday, April 07, 2009 10:53 AM
> > > >> > > To: hbase-dev@hadoop.apache.org
> > > >> > > Subject: ZK rethink?
> > > >> > >
> > > >> > >
> > > >> > > I think an assumption about ZK has been made that is
> wrong:
> > > >> > > The assumption is that ZK sessions are reliable, so
> taking
> > > >> > > immediate action from a watcher when an ephemeral node
> goes
> > > >> > > away is safe, but ZK sessions can expire for a number of
> > > >> > > reasons not related to the process holding the handle
> going
> > > >> > > away. So serious issues like HBASE-1314 result.
> > > >> > >
> > > >> > > Some problems related to session expiration can be easily
> > > >> > > handled by having the ZK wrapper reinitialize the ZK
> handle
> > > >> > > and recreate ephemeral nodes when it is informed that its
> > > >> > > session has expired. However the problem with watchers
> > > >> > > seeing deletions and taking (inappropriate) action
> remains.
> > > >> > > In my opinion, every place in the code where watchers on
> > > >> > > znodes are used to determine the state of something needs
> > > >> > > to be reworked.
> > > >> > >
> > > >> > > One option is to start a timer when a znode disappears
> and
> > > >> > > watch for its reappearance while the timer is running. If
> > > >> > > the timer expires without reappearance, then take action.
> > > >> > >
> > > >> > > Another option is to not use ephemeral nodes. Have the
> > > >> > > readers discover their znodes of interest and then poll
> > > >> > > them. Include timestamps in the stored data to determine
> > > >> > > freshness. Declare a node expired beyond some delta
> between
> > > >> > > last update and current time, and then take action. (The
> > > >> > > poller can delete the znode also to clean up.)
> > > >> > >
> > > >> > >    - Andy
> > > >> >
> > > >> >
> > > >> >
> > > >> >
> > > >
> > >
> >

Mime
View raw message