hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Nicolas Liochon <nkey...@gmail.com>
Subject Re: Strange issue when DataNode goes down
Date Fri, 20 Mar 2015 13:36:20 GMT
You've changed the value of hbase.zookeeper.timeout to 15 minutes? A very
reasonable target is 1 minute before relocating the regions. That's the
default iirc. You can push it to 20s, but then gc-stopping-the-world
becomes more of an issue. 15 minutes is really a lot. The hdfs stale mode
must always be used, with a lower timeout than the hbase one.

Client side there should be nothing to do (excepted for advanced stuff); at
each retry the client checks the location of the regions. If you lower the
number of retry the client will fail sooner, but usually you don't want the
client to fail, you want the servers to reallocate quickly.

On Fri, Mar 20, 2015 at 1:36 PM, Dejan Menges <dejan.menges@gmail.com>
wrote:

> Hi,
>
> Sorry for little bit late update, but managed to narrow it little bit down.
>
> We didn't update yet, as we are using Hortonworks distribution right now,
> and even if we update we will get 0.98.4. However, looks that issue here
> was in our use case and configuration (still looking into it).
>
> Basically, initially I saw that when one server goes down, we start having
> performance issues in general, but it managed to be on our client side, due
> to caching, and clients were trying to reconnect to nodes that were offline
> and later trying to get regions from those nodes too. This is basically why
> on server side I didn't manage to see anything in logs that would be at
> least little bit interesting and point me into desired direction.
>
> Another question that popped up to me is - in case server is down (and with
> it DataNode and HRegionServer it was hosting) - what's optimal time to set
> for HMaster to consider server dead reassign regions somewhere else, as
> this is another performance bottleneck we hit during inability to access
> regions? In our case it's configured to 15 minutes, and simple logic tells
> me if you want it earlier then configure lower number of retries, but issue
> is as always in details, so not sure if anyone knows some better math for
> this?
>
> And last question - is it possible to manually force HBase to reassign
> regions? In this case, while HMaster is retrying to contact node that's
> dead, it's impossible to force it using 'balancer' command.
>
> Thanks a lot!
>
> Dejan
>
> On Tue, Mar 17, 2015 at 9:37 AM Dejan Menges <dejan.menges@gmail.com>
> wrote:
>
> > Hi,
> >
> > To be very honest - there's no particular reason why we stick to this
> one,
> > beside just lack of time currently to go through upgrade process, but
> looks
> > to me that's going to be next step.
> >
> > Had a crazy day, didn't have time to go through all logs again, plus one
> > of the machines (last one where we had this issue) is fully reprovisioned
> > yesterday so I don't have logs from there anymore.
> >
> > Beside upgrading,  what I will talk about today, can you just point me to
> > the specific RPC issue in 0.98.0? Thing is that we have some strange
> > moments with RPC in this case, and just want to see if that's the same
> > thing (and we were even suspecting to RPC).
> >
> > Thanks a lot!
> > Dejan
> >
> > On Mon, Mar 16, 2015 at 9:32 PM, Andrew Purtell <apurtell@apache.org>
> > wrote:
> >
> >> Is there a particular reason why you are using HBase 0.98.0? The latest
> >> 0.98 release is 0.98.11. There's a known performance issue with 0.98.0
> >> pertaining to RPC that was fixed in later releases, you should move up
> >> from
> >> 0.98.0. In addition hundreds of improvements and bug fixes have gone
> into
> >> the ten releases since 0.98.0.
> >>
> >> On Mon, Mar 16, 2015 at 6:40 AM, Dejan Menges <dejan.menges@gmail.com>
> >> wrote:
> >>
> >> > Hi All,
> >> >
> >> > We have a strange issue with HBase performance (overall cluster
> >> > performance) in case one of datanodes in the cluster unexpectedly goes
> >> > down.
> >> >
> >> > So scenario is like follows:
> >> > - Cluster works fine, it's stable.
> >> > - One DataNode unexpectedly goes down (PSU issue, network issue,
> >> anything)
> >> > - Whole HBase cluster goes down (performance becomes so bad that we
> >> have to
> >> > restart all RegionServers to get it back to life).
> >> >
> >> > Most funny and latest issue that happened was that we added new node
> to
> >> the
> >> > cluster (having 8 x 4T SATA disks) and we left just DataNode running
> on
> >> it
> >> > to give it couple of days to get some data. At some point in time, due
> >> to
> >> > hardware issue, server rebooted (twice during three hours) in moment
> >> when
> >> > it had maybe 5% of data it would have in a couple of days. Nothing
> else
> >> > beside DataNode was running, and once it went down, it affected
> literary
> >> > everything, and restarting RegionServers in the end fixed it.
> >> >
> >> > We are using HBase 0.98.0 with Hadoop 2.4.0
> >> >
> >>
> >>
> >>
> >> --
> >> Best regards,
> >>
> >>    - Andy
> >>
> >> Problems worthy of attack prove their worth by hitting back. - Piet Hein
> >> (via Tom White)
> >>
> >
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message