hbase-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jean-Daniel Cryans <jdcry...@apache.org>
Subject Re: ZK rethink?
Date Wed, 08 Apr 2009 01:10:19 GMT
+1 on the C API wrapper. Great idea. I think this should even reside
in a ZK contrib.

J-D

On Tue, Apr 7, 2009 at 8:01 PM, Jim Kellerman (POWERSET)
<Jim.Kellerman@microsoft.com> wrote:
> I don't think that's a bad idea. If we set a (configurable) limit on
> the block cache, we can let it grow to that point and then start
> reusing blocks on an LRU basis.
>
> If the blocks get really stale, we might consider giving them back.
>
> ---
> Jim Kellerman, Powerset (Live Search, Microsoft Corporation)
>
>
>> -----Original Message-----
>> From: Ryan Rawson [mailto:ryanobjc@gmail.com]
>> Sent: Tuesday, April 07, 2009 4:37 PM
>> To: hbase-dev@hadoop.apache.org
>> Subject: Re: ZK rethink?
>>
>> Again, the GC bites us in the ass.
>>
>> We'll have to worry about heap fragementation and reusing blocks
>> when we are
>> doing a block cache.  Otherwise we'll end up sucking up ram that is
>> dead,
>> waiting for a full-gc to come along and help us.
>>
>> we might want to think about doing some kind of block reuse in the
>> LRU cache
>> for hbase.  it could get ugly fast here, since we'd essentially be
>> writing
>> memory allocation code. But it might be necessary to avoid GC
>> problems in a
>> big heap.
>>
>>
>>
>> On Tue, Apr 7, 2009 at 4:25 PM, Jim Kellerman (POWERSET) <
>> Jim.Kellerman@microsoft.com> wrote:
>>
>> > Ryan,
>> >
>> > Good idea! Because JVM thread scheduling is not fair, nor does
>> > it pay attention to thread priorities, Java is highly susceptible
>> > to thread starvation which can lead to HRS not being able to
>> report
>> > to the master (zookeeper) prior to the lease timeout.
>> >
>> > ---
>> > Jim Kellerman, Powerset (Live Search, Microsoft Corporation)
>> >
>> >
>> > > -----Original Message-----
>> > > From: Ryan Rawson [mailto:ryanobjc@gmail.com]
>> > > Sent: Tuesday, April 07, 2009 1:54 PM
>> > > To: hbase-dev@hadoop.apache.org
>> > > Cc: joey42+reply@gmail.com <joey42%2Breply@gmail.com>
>> > > Subject: Re: ZK rethink?
>> > >
>> > > Thanks for the input Joey, and may I be the first to say "holy
>> > > shit".
>> > >
>> > > The reason their approach works is because the C API spins off
>> OS
>> > > threads
>> > > that exist outside the domain of the Java VM, which means those
>> > > threads
>> > > never get paused for GC processing.
>> > >
>> > > With that kind of input, we might want to consider doing what he
>> > > did.  Maybe
>> > > you can donate a bit of code?
>> > >
>> > > Thanks!
>> > > -ryan
>> > >
>> > > On Tue, Apr 7, 2009 at 1:49 PM, Nitay <nitayj@gmail.com> wrote:
>> > >
>> > > > Very interesting Joey. Thanks for replying with this
>> information.
>> > > Also,
>> > > > welcome! :).
>> > > >
>> > > > I don't quite understand why the C API with JNI fixes the
>> problem.
>> > > Did that
>> > > > substantially reduce your tiny, short lived objects to the
>> point
>> > > where the
>> > > > GC wasn't starving the ZooKeeper IO threads anymore?
>> > > >
>> > > > Perhaps my initial 10 second value was not enough. Andrew, can
>> you
>> > > try 30
>> > > > or
>> > > > 60 seconds as a test on your cluster to see if that calms
>> things
>> > > down?
>> > > >
>> > > > -n
>> > > >
>> > > > On Tue, Apr 7, 2009 at 1:43 PM, Joey Echeverria
>> <joey42@gmail.com>
>> > > wrote:
>> > > >
>> > > > > Long time lurker, first time poster.
>> > > > >
>> > > > > We've used zookeeper in a write-heavy project we've been
>> working
>> > > on
>> > > > > and experienced issues similar to what you described. After
>> > > several
>> > > > > days of debugging, we discovered that our issue was garbage
>> > > > > collection. There was no way to guarantee we wouldn't have
>> long
>> > > pauses
>> > > > > especially since our environment was the worst case for
>> garbage
>> > > > > collection, millions of tiny, short lived objects. I suspect
>> > > HBase
>> > > > > sees similar work loads frequently, if it's not constantly.
>> With
>> > > > > anything shorter than a 30 second session time out, we got
>> > > session
>> > > > > expiration events extremely frequently. We needed to use 60
>> > > seconds
>> > > > > for any real confidence that an ephemeral node disappearing
>> > > meant
>> > > > > something was unavailable.
>> > > > >
>> > > > > We really wanted quick recovery so we ended up writing a
>> light-
>> > > weight
>> > > > > wrapper around the C API and used swig to auto-generate a
>> JNI
>> > > > > interface. It's not perfect, but since we switched to this
>> > > method
>> > > > > we've never seen a session expiration event and ephemeral
>> nodes
>> > > only
>> > > > > disappear when there are network issues or a machine/process
>> > > goes
>> > > > > down.
>> > > > >
>> > > > > I don't know if it's worth doing the same kind of thing for
>> > > HBase as
>> > > > > it adds some "unnecessary" native code, but it's a solution
>> that
>> > > I
>> > > > > found works.
>> > > > >
>> > > > > On Tue, Apr 7, 2009 at 9:28 PM, Jim Kellerman (POWERSET)
>> > > > > <Jim.Kellerman@microsoft.com> wrote:
>> > > > > > There are a number of reasons why Zookeeper could receive
>> a
>> > > > > SessionExpired
>> > > > > > event:
>> > > > > > - The process died
>> > > > > > - The machine died
>> > > > > > - The is/was a network partitioning
>> > > > > > - The network is flapping
>> > > > > >
>> > > > > > This is why the lease timeout is set to 2 minutes by
>> default.
>> > > If things
>> > > > > > haven't recovered in two minutes, we assume that the
>> region
>> > > server is
>> > > > > > dead, hung or in any event, unresponsive. Maybe we should
>> add
>> > > an API
>> > > > > > to the region server such that the Master (or Zookeeper)
>> could
>> > > call it
>> > > > > > and ask if it is still alive, before starting region
>> server
>> > > recovery
>> > > > > > (ProcessServerShutdown).
>> > > > > >
>> > > > > > ---
>> > > > > > Jim Kellerman, Powerset (Live Search, Microsoft
>> Corporation)
>> > > > > >
>> > > > > >
>> > > > > >> -----Original Message-----
>> > > > > >> From: Nitay [mailto:nitayj@gmail.com]
>> > > > > >> Sent: Tuesday, April 07, 2009 1:13 PM
>> > > > > >> To: hbase-dev@hadoop.apache.org; apurtell@apache.org
>> > > > > >> Subject: Re: ZK rethink?
>> > > > > >>
>> > > > > >> Hi Andrew,
>> > > > > >>
>> > > > > >> I agree with you that getting a SessionExpired is a
>> problem
>> > > for us,
>> > > > > >> and we
>> > > > > >> didn't really consider it when we initially put in the
>> > > ZooKeeper
>> > > > > >> code.
>> > > > > >> However, I don't necessarily think a complete rethink
is
>> > > necessary.
>> > > > > >>
>> > > > > >> The main issue here is how often a SessionExpired is
>> going to
>> > > > > >> happen, and
>> > > > > >> why it is happening that often. Most people using
>> ZooKeeper
>> > > use a
>> > > > > >> session
>> > > > > >> timeout of 2 or 3 seconds. A SessionExpired occurs when
>> you
>> > > lose
>> > > > > >> connection
>> > > > > >> to the ZooKeeper instance you were talking to and are
>> unable
>> > > to
>> > > > > >> connect to
>> > > > > >> another one within this time frame. In HBase, we use
10
>> > > seconds for
>> > > > > >> this
>> > > > > >> interval. Given that, I think we should do some recon
>> work
>> > > first to
>> > > > > >> determine what's going on. When does it happen? Why?
Is
>> the
>> > > > > >> ZooKeeper IO
>> > > > > >> thread getting starved for long periods of time? Can
we
>> > > prevent it?
>> > > > > >> The
>> > > > > >> ZooKeeper folks describe SessionExpired as a very, very
>> rare
>> > > event,
>> > > > > >> yet that
>> > > > > >> does not seem to be the case for us.
>> > > > > >>
>> > > > > >> Issues like HBASE-1314 are certainly a bug. If we think
a
>> > > node is
>> > > > > >> dead
>> > > > > >> because its ephemeral ZNode has vanished we should not
>> try
>> > > talking
>> > > > > >> to it
>> > > > > >> anymore. We cannot have a case where we both think it's
>> dead
>> > > and are
>> > > > > >> talking
>> > > > > >> to.
>> > > > > >>
>> > > > > >> If, after some investigation, we come to the conclusion
>> that
>> > > these
>> > > > > >> SessionExpired events are unavoidable things that will
>> happen
>> > > quite
>> > > > > >> frequently, then yes I think something like what you
>> suggest
>> > > is a
>> > > > > >> good idea.
>> > > > > >> But if these events only really do happen once in a
blue
>> moon
>> > > as it
>> > > > > >> seems
>> > > > > >> they're supposed to, then perhaps simply internally
>> > > restarting the
>> > > > > >> node in
>> > > > > >> question is not so bad?
>> > > > > >>
>> > > > > >> Within the solutions you propose I would opt for the
>> timer
>> > > option. I
>> > > > > >> don't
>> > > > > >> think that not using ephemeral nodes with watches is
a
>> good
>> > > > > >> solution. It
>> > > > > >> shifts us away from using the power that ZooKeeper
>> provides.
>> > > > > >> Assuming at
>> > > > > >> some point ZooKeeper gets more reliable with its
>> sessions, we
>> > > will
>> > > > > >> have a
>> > > > > >> lot of code to change if we want to undo the decision.
>> > > > > >>
>> > > > > >> Regardless of what we end up going with, we need to
do
>> > > _something_
>> > > > > >> on the
>> > > > > >> RS/master when they get a SessionExpired, because we
>> > > currently will
>> > > > > >> get
>> > > > > >> wedged. That's what I'm working on right now (HBASE-1311,
>> > > HBASE-
>> > > > > >> 1312).
>> > > > > >>
>> > > > > >> Thanks for bringing this up Andrew. I'm glad we have
a
>> > > cluster like
>> > > > > >> yours to
>> > > > > >> bring out these sorts of problems. I look forward to
>> further
>> > > > > >> discussion on
>> > > > > >> this topic and hearing other people's thoughts.
>> > > > > >>
>> > > > > >> Cheers,
>> > > > > >> -n
>> > > > > >>
>> > > > > >> On Tue, Apr 7, 2009 at 11:10 AM, Andrew Purtell
>> > > > > >> <apurtell@apache.org> wrote:
>> > > > > >>
>> > > > > >> >
>> > > > > >> > Hi Chad,
>> > > > > >> >
>> > > > > >> > In my testing the session expiration happens due
to
>> missed
>> > > IO
>> > > > > >> > like as with ZOOKEEPER-344, which is currently
open.
>> > > > > >> >
>> > > > > >> >  https://issues.apache.org/jira/browse/ZOOKEEPER-344
>> > > > > >> >
>> > > > > >> > Also a Google search for "zookeeper session expired"
>> turns
>> > > up
>> > > > > >> > some conversation already on the topic.
>> > > > > >> >
>> > > > > >> >  - Andy
>> > > > > >> >
>> > > > > >> >
>> > > > > >> > > From: Chad Walters
>> > > > > >> > > Subject: RE: ZK rethink?
>> > > > > >> > > To: "hbase-dev@hadoop.apache.org" <hbase-
>> > > dev@hadoop.apache.org>
>> > > > > >> > > Date: Tuesday, April 7, 2009, 10:57 AM
>> > > > > >> > >
>> > > > > >> > > Has this been discussed at all with the ZooKeeper
>> > > > > >> > > developers?
>> > > > > >> > >
>> > > > > >> > > Chad
>> > > > > >> > >
>> > > > > >> > > -----Original Message-----
>> > > > > >> > > From: Andrew Purtell [mailto:apurtell@apache.org]
>> > > > > >> > > Sent: Tuesday, April 07, 2009 10:53 AM
>> > > > > >> > > To: hbase-dev@hadoop.apache.org
>> > > > > >> > > Subject: ZK rethink?
>> > > > > >> > >
>> > > > > >> > >
>> > > > > >> > > I think an assumption about ZK has been made
that is
>> > > wrong:
>> > > > > >> > > The assumption is that ZK sessions are reliable,
so
>> > > taking
>> > > > > >> > > immediate action from a watcher when an ephemeral
>> node
>> > > goes
>> > > > > >> > > away is safe, but ZK sessions can expire for
a number
>> of
>> > > > > >> > > reasons not related to the process holding
the handle
>> > > going
>> > > > > >> > > away. So serious issues like HBASE-1314 result.
>> > > > > >> > >
>> > > > > >> > > Some problems related to session expiration
can be
>> easily
>> > > > > >> > > handled by having the ZK wrapper reinitialize
the ZK
>> > > handle
>> > > > > >> > > and recreate ephemeral nodes when it is informed
that
>> its
>> > > > > >> > > session has expired. However the problem with
>> watchers
>> > > > > >> > > seeing deletions and taking (inappropriate)
action
>> > > remains.
>> > > > > >> > > In my opinion, every place in the code where
watchers
>> on
>> > > > > >> > > znodes are used to determine the state of
something
>> needs
>> > > > > >> > > to be reworked.
>> > > > > >> > >
>> > > > > >> > > One option is to start a timer when a znode
>> disappears
>> > > and
>> > > > > >> > > watch for its reappearance while the timer
is
>> running. If
>> > > > > >> > > the timer expires without reappearance, then
take
>> action.
>> > > > > >> > >
>> > > > > >> > > Another option is to not use ephemeral nodes.
Have
>> the
>> > > > > >> > > readers discover their znodes of interest
and then
>> poll
>> > > > > >> > > them. Include timestamps in the stored data
to
>> determine
>> > > > > >> > > freshness. Declare a node expired beyond some
delta
>> > > between
>> > > > > >> > > last update and current time, and then take
action.
>> (The
>> > > > > >> > > poller can delete the znode also to clean
up.)
>> > > > > >> > >
>> > > > > >> > >    - Andy
>> > > > > >> >
>> > > > > >> >
>> > > > > >> >
>> > > > > >> >
>> > > > > >
>> > > > >
>> > > >
>> >
>

Mime
View raw message