Mailing-List: contact user-help@hbase.apache.org; run by ezmlm
Precedence: bulk
Reply-To: user@hbase.apache.org
Received-SPF: pass (nike.apache.org: domain of dejan.menges@gmail.com
 designates 209.85.215.48 as permitted sender)
MIME-Version: 1.0
References: 
 <CAEf6Z5+P7QmvLYkpFiTJe3Snmu7k8EQiMczGhxs6s8yJ0jEg6Q@mail.gmail.com>
 <CA+RK=_BeiNQPM8FnqWU6B8C7JH6srvZPcH70cqjnr=94rDcSuw@mail.gmail.com>
 <CAEf6Z5K_hLhumK5GiyZ_wRkdA5sCKjJjVS-u-+6UcXUon_mCEQ@mail.gmail.com>
 <CAEf6Z5JTbjr3R1QEt0aLty53+Nz2=6Ufk4NwNn1SfyS1wWf13g@mail.gmail.com>
 <CAPcDmStm6Sd7d+T1L74wgkHM5fBW4RAi61W4VOreUbjdmerLtA@mail.gmail.com>
In-Reply-To: 
 <CAPcDmStm6Sd7d+T1L74wgkHM5fBW4RAi61W4VOreUbjdmerLtA@mail.gmail.com>
From: Dejan Menges <dejan.menges@gmail.com>
Date: Fri, 20 Mar 2015 14:38:48 +0000
Message-ID: 
 <CAEf6Z5+s4k2pzUG6wxdTCaQCFKBe6QmrWiBEpdkcQjMPsux2Hg@mail.gmail.com>
Subject: Re: Strange issue when DataNode goes down
To: user <user@hbase.apache.org>
Content-Type: multipart/alternative; boundary=001a113433f845d13c0511b946b4

--001a113433f845d13c0511b946b4
Content-Type: text/plain; charset=UTF-8

With client issue was that it was retrying connecting to the same region
servers even when they were reassigned. Lowering it down helped in this
specific use case, but yes, we still want servers to reallocate quickly.

What got me here:

http://hbase.apache.org/book.html#mttr

I basically set configuration exactly the same way as it's explained here.
*zookeeper.session.timeout* is (and was before) 60000 (one minute).

So basically what happens is:

- I have test cluster consisting of four nodes, every node being DataNode
and RegionServer.
- I simulate network partition on one (connect to it through console and
take network interface down - simulating network issues we had recently).
- After short time I see in HBase that my RegionServer is dead, and as
total number of regions I see previous total minus number of regions that
were hosted on the node hosting RegionServer that just 'disappeared'.
- In this point I want my cluster to recover as soon as possible, to start
serving missing regions.
- First thing I see in HMaster logs are:

2015-03-20 14:17:26,015 INFO
org.apache.hadoop.hbase.zookeeper.RegionServerTracker: RegionServer
ephemeral node deleted, processing expiration
[{name_of_node_I_took_down},60020,1426860403261]

2015-03-20 14:17:26,067 INFO
org.apache.hadoop.hbase.master.handler.ServerShutdownHandler: Splitting
logs for {name_of_node_I_took_down},60020,1426860403261 before assignment.

2015-03-20 14:17:26,105 INFO
org.apache.hadoop.hbase.master.SplitLogManager: dead splitlog workers [
{name_of_node_I_took_down},60020,1426860403261]

2015-03-20 14:17:26,107 INFO
org.apache.hadoop.hbase.master.SplitLogManager: started splitting 1 logs in
[hdfs://{fqdn_of_hmaster}:8020/hbase/WALs/{name_of_node_I_took_down}
,60020,1426860403261-splitting]

2015-03-20 14:17:26,150 INFO
org.apache.hadoop.hbase.master.SplitLogManager: task /hbase/splitWAL/WALs%2F
{name_of_node_I_took_down}%2C60020%2C1426860403261-splitting%2F
{name_of_node_I_took_down}%252C60020%252C1426860403261.1426860404905
acquired by {fqdn_of_another_live_hnode},60020,1426859445623

2015-03-20 14:17:26,182 INFO
org.apache.hadoop.hbase.master.SplitLogManager: total tasks = 1 unassigned
= 0 tasks={/hbase/splitWAL/WALs%2F{name_of_node_I_took_down}
%2C60020%2C1426860403261-splitting%2F{name_of_node_I_took_down}%252C60020%252C1426860403261.1426860404905=last_update
= 1426861046182 last_version = 2 cur_worker_name =
{fqdn_of_another_live_node},60020,1426859445623 status = in_progress
incarnation = 0 resubmits = 0 batch = installed = 1 done = 0 error = 0}

2015-03-20 14:17:31,183 INFO
org.apache.hadoop.hbase.master.SplitLogManager: total tasks = 1 unassigned
= 0 tasks={/hbase/splitWAL/WALs%2F{name_of_node_I_took_down}
%2C60020%2C1426860403261-splitting%2F{name_of_node_I_took_down}%252C60020%252C1426860403261.1426860404905=last_update
= 1426861046182 last_version = 2 cur_worker_name =
{fqdn_of_another_live_node},60020,1426859445623 status = in_progress
incarnation = 0 resubmits = 0 batch = installed = 1 done = 0 error = 0}

2015-03-20 14:17:36,184 INFO
org.apache.hadoop.hbase.master.SplitLogManager: total tasks = 1 unassigned
= 0 tasks={/hbase/splitWAL/WALs%2F{name_of_node_I_took_down}
%2C60020%2C1426860403261-splitting%2F{name_of_node_I_took_down}%252C60020%252C1426860403261.1426860404905=last_update
= 1426861046182 last_version = 2 cur_worker_name =
{fqdn_of_another_live_node},60020,1426859445623 status = in_progress
incarnation = 0 resubmits = 0 batch = installed = 1 done = 0 error = 0}

2015-03-20 14:17:42,185 INFO
org.apache.hadoop.hbase.master.SplitLogManager: total tasks = 1 unassigned
= 0 tasks={/hbase/splitWAL/WALs%2F{name_of_node_I_took_down}
%2C60020%2C1426860403261-splitting%2F{name_of_node_I_took_down}%252C60020%252C1426860403261.1426860404905=last_update
= 1426861046182 last_version = 2 cur_worker_name =
{fqdn_of_another_live_node},60020,1426859445623 status = in_progress
incarnation = 0 resubmits = 0 batch = installed = 1 done = 0 error = 0}

2015-03-20 14:17:48,184 INFO
org.apache.hadoop.hbase.master.SplitLogManager: total tasks = 1 unassigned
= 0 tasks={/hbase/splitWAL/WALs%2F{name_of_node_I_took_down}
%2C60020%2C1426860403261-splitting%2F{name_of_node_I_took_down}%252C60020%252C1426860403261.1426860404905=last_update
= 1426861046182 last_version = 2 cur_worker_name =
{fqdn_of_another_live_node},60020,1426859445623 status = in_progress
incarnation = 0 resubmits = 0 batch = installed = 1 done = 0 error = 0}
In the meantime, In hbase...out log I got this:

==> hbase-hbase-master-{fqdn_of_my_hmaster_node}.out <==

java.io.IOException: Call to
{name_of_node_I_took_down}/{ip_of_local_interface_I_took_down}:60020
failed on local exception:
org.apache.hadoop.hbase.ipc.RpcClient$CallTimeoutException: Call
id=93152, waitTime=60044, rpcTimeout=60000

at org.apache.hadoop.hbase.ipc.RpcClient.wrapException(RpcClient.java:1532)

at org.apache.hadoop.hbase.ipc.RpcClient.call(RpcClient.java:1502)

at
org.apache.hadoop.hbase.ipc.RpcClient.callBlockingMethod(RpcClient.java:1684)

at
org.apache.hadoop.hbase.ipc.RpcClient$BlockingRpcChannelImplementation.callBlockingMethod(RpcClient.java:1737)

at
org.apache.hadoop.hbase.protobuf.generated.AdminProtos$AdminService$BlockingStub.getRegionInfo(AdminProtos.java:20806)

at
org.apache.hadoop.hbase.client.HBaseAdmin.getCompactionState(HBaseAdmin.java:2524)

at
org.apache.hadoop.hbase.generated.master.table_jsp._jspService(table_jsp.java:167)

at org.apache.jasper.runtime.HttpJspBase.service(HttpJspBase.java:98)

at javax.servlet.http.HttpServlet.service(HttpServlet.java:770)

at org.mortbay.jetty.servlet.ServletHolder.handle(ServletHolder.java:511)

at
org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1221)

at
org.apache.hadoop.http.lib.StaticUserWebFilter$StaticUserFilter.doFilter(StaticUserWebFilter.java:109)

at
org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1212)

at
org.apache.hadoop.http.HttpServer$QuotingInputFilter.doFilter(HttpServer.java:1081)

at
org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1212)

at org.apache.hadoop.http.NoCacheFilter.doFilter(NoCacheFilter.java:45)

at
org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1212)

at org.mortbay.jetty.servlet.ServletHandler.handle(ServletHandler.java:399)

at
org.mortbay.jetty.security.SecurityHandler.handle(SecurityHandler.java:216)

at org.mortbay.jetty.servlet.SessionHandler.handle(SessionHandler.java:182)

at org.mortbay.jetty.handler.ContextHandler.handle(ContextHandler.java:766)

at org.mortbay.jetty.webapp.WebAppContext.handle(WebAppContext.java:450)

at
org.mortbay.jetty.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:230)

at org.mortbay.jetty.handler.HandlerWrapper.handle(HandlerWrapper.java:152)

at org.mortbay.jetty.Server.handle(Server.java:326)

at org.mortbay.jetty.HttpConnection.handleRequest(HttpConnection.java:542)

at
org.mortbay.jetty.HttpConnection$RequestHandler.headerComplete(HttpConnection.java:928)

at org.mortbay.jetty.HttpParser.parseNext(HttpParser.java:549)

at org.mortbay.jetty.HttpParser.parseAvailable(HttpParser.java:212)

at org.mortbay.jetty.HttpConnection.handle(HttpConnection.java:404)

at
org.mortbay.io.nio.SelectChannelEndPoint.run(SelectChannelEndPoint.java:410)

at
org.mortbay.thread.QueuedThreadPool$PoolThread.run(QueuedThreadPool.java:582)

Caused by: org.apache.hadoop.hbase.ipc.RpcClient$CallTimeoutException: Call
id=93152, waitTime=60044, rpcTimeout=60000

at
org.apache.hadoop.hbase.ipc.RpcClient$Connection.cleanupCalls(RpcClient.java:1234)

at
org.apache.hadoop.hbase.ipc.RpcClient$Connection.readResponse(RpcClient.java:1171)

at org.apache.hadoop.hbase.ipc.RpcClient$Connection.run(RpcClient.java:751)
Beside this same issue, please note that first message was at 2015-03-20
14:17:26,015. And then (we got to the point when it started transition):

2015-03-20 14:32:35,059 INFO
org.apache.hadoop.hbase.master.SplitLogManager: task /hbase/splitWAL/WALs%2F
{name_of_node_I_took_down}%2C60020%2C1426860403261-splitting%2F
{name_of_node_I_took_down}%252C60020%252C1426860403261.1426860404905
entered state: DONE {fqdn_of_new_live_node},60020,1426859445623

2015-03-20 14:32:35,109 INFO
org.apache.hadoop.hbase.master.SplitLogManager: Done splitting
/hbase/splitWAL/WALs%2F{name_of_node_I_took_down}
%2C60020%2C1426860403261-splitting%2F{name_of_node_I_took_down}
%252C60020%252C1426860403261.1426860404905

2015-03-20 14:32:35,190 INFO
org.apache.hadoop.hbase.master.SplitLogManager: finished splitting (more
than or equal to) 9 bytes in 1 log files in
[hdfs://{fqdn_of_my_hmaster_node}:8020/hbase/WALs/{name_of_node_I_took_down},60020,1426860403261-splitting]
in 909083ms

2015-03-20 14:32:35,191 INFO org.apache.hadoop.hbase.master.RegionStates:
Transitioned {0e7cc87a4ef6c47a779186f5bf79a01c state=OPEN,
ts=1426860639088, server={name_of_node_I_took_down},60020,1426860403261} to
{0e7cc87a4ef6c47a779186f5bf79a01c state=OFFLINE, ts=1426861955191, server=
{name_of_node_I_took_down},60020,1426860403261}

2015-03-20 14:32:35,191 INFO org.apache.hadoop.hbase.master.RegionStates:
Offlined 0e7cc87a4ef6c47a779186f5bf79a01c from {name_of_node_I_took_down}
,60020,1426860403261

2015-03-20 14:32:35,191 INFO org.apache.hadoop.hbase.master.RegionStates:
Transitioned {25ab6e7b42e36ddaa723d71be5954543 state=OPEN,
ts=1426860641783, server={name_of_node_I_took_down},60020,1426860403261} to
{25ab6e7b42e36ddaa723d71be5954543 state=OFFLINE, ts=1426861955191, server=
{name_of_node_I_took_down},60020,1426860403261}

2015-03-20 14:32:35,191 INFO org.apache.hadoop.hbase.master.RegionStates:
Offlined 25ab6e7b42e36ddaa723d71be5954543 from {name_of_node_I_took_down}
,60020,1426860403261
At this point, note that it finished SplitLogManager task at 14:32:35 and
started transitioning just after that. So this is 15 minutes that I'm
talking about.

What am I missing?


On Fri, Mar 20, 2015 at 2:37 PM Nicolas Liochon <nkeywal@gmail.com> wrote:

> You've changed the value of hbase.zookeeper.timeout to 15 minutes? A very
> reasonable target is 1 minute before relocating the regions. That's the
> default iirc. You can push it to 20s, but then gc-stopping-the-world
> becomes more of an issue. 15 minutes is really a lot. The hdfs stale mode
> must always be used, with a lower timeout than the hbase one.
>
> Client side there should be nothing to do (excepted for advanced stuff); at
> each retry the client checks the location of the regions. If you lower the
> number of retry the client will fail sooner, but usually you don't want the
> client to fail, you want the servers to reallocate quickly.
>
> On Fri, Mar 20, 2015 at 1:36 PM, Dejan Menges <dejan.menges@gmail.com>
> wrote:
>
> > Hi,
> >
> > Sorry for little bit late update, but managed to narrow it little bit
> down.
> >
> > We didn't update yet, as we are using Hortonworks distribution right now,
> > and even if we update we will get 0.98.4. However, looks that issue here
> > was in our use case and configuration (still looking into it).
> >
> > Basically, initially I saw that when one server goes down, we start
> having
> > performance issues in general, but it managed to be on our client side,
> due
> > to caching, and clients were trying to reconnect to nodes that were
> offline
> > and later trying to get regions from those nodes too. This is basically
> why
> > on server side I didn't manage to see anything in logs that would be at
> > least little bit interesting and point me into desired direction.
> >
> > Another question that popped up to me is - in case server is down (and
> with
> > it DataNode and HRegionServer it was hosting) - what's optimal time to
> set
> > for HMaster to consider server dead reassign regions somewhere else, as
> > this is another performance bottleneck we hit during inability to access
> > regions? In our case it's configured to 15 minutes, and simple logic
> tells
> > me if you want it earlier then configure lower number of retries, but
> issue
> > is as always in details, so not sure if anyone knows some better math for
> > this?
> >
> > And last question - is it possible to manually force HBase to reassign
> > regions? In this case, while HMaster is retrying to contact node that's
> > dead, it's impossible to force it using 'balancer' command.
> >
> > Thanks a lot!
> >
> > Dejan
> >
> > On Tue, Mar 17, 2015 at 9:37 AM Dejan Menges <dejan.menges@gmail.com>
> > wrote:
> >
> > > Hi,
> > >
> > > To be very honest - there's no particular reason why we stick to this
> > one,
> > > beside just lack of time currently to go through upgrade process, but
> > looks
> > > to me that's going to be next step.
> > >
> > > Had a crazy day, didn't have time to go through all logs again, plus
> one
> > > of the machines (last one where we had this issue) is fully
> reprovisioned
> > > yesterday so I don't have logs from there anymore.
> > >
> > > Beside upgrading,  what I will talk about today, can you just point me
> to
> > > the specific RPC issue in 0.98.0? Thing is that we have some strange
> > > moments with RPC in this case, and just want to see if that's the same
> > > thing (and we were even suspecting to RPC).
> > >
> > > Thanks a lot!
> > > Dejan
> > >
> > > On Mon, Mar 16, 2015 at 9:32 PM, Andrew Purtell <apurtell@apache.org>
> > > wrote:
> > >
> > >> Is there a particular reason why you are using HBase 0.98.0? The
> latest
> > >> 0.98 release is 0.98.11. There's a known performance issue with 0.98.0
> > >> pertaining to RPC that was fixed in later releases, you should move up
> > >> from
> > >> 0.98.0. In addition hundreds of improvements and bug fixes have gone
> > into
> > >> the ten releases since 0.98.0.
> > >>
> > >> On Mon, Mar 16, 2015 at 6:40 AM, Dejan Menges <dejan.menges@gmail.com
> >
> > >> wrote:
> > >>
> > >> > Hi All,
> > >> >
> > >> > We have a strange issue with HBase performance (overall cluster
> > >> > performance) in case one of datanodes in the cluster unexpectedly
> goes
> > >> > down.
> > >> >
> > >> > So scenario is like follows:
> > >> > - Cluster works fine, it's stable.
> > >> > - One DataNode unexpectedly goes down (PSU issue, network issue,
> > >> anything)
> > >> > - Whole HBase cluster goes down (performance becomes so bad that we
> > >> have to
> > >> > restart all RegionServers to get it back to life).
> > >> >
> > >> > Most funny and latest issue that happened was that we added new node
> > to
> > >> the
> > >> > cluster (having 8 x 4T SATA disks) and we left just DataNode running
> > on
> > >> it
> > >> > to give it couple of days to get some data. At some point in time,
> due
> > >> to
> > >> > hardware issue, server rebooted (twice during three hours) in moment
> > >> when
> > >> > it had maybe 5% of data it would have in a couple of days. Nothing
> > else
> > >> > beside DataNode was running, and once it went down, it affected
> > literary
> > >> > everything, and restarting RegionServers in the end fixed it.
> > >> >
> > >> > We are using HBase 0.98.0 with Hadoop 2.4.0
> > >> >
> > >>
> > >>
> > >>
> > >> --
> > >> Best regards,
> > >>
> > >>    - Andy
> > >>
> > >> Problems worthy of attack prove their worth by hitting back. - Piet
> Hein
> > >> (via Tom White)
> > >>
> > >
> > >
> >
>

--001a113433f845d13c0511b946b4--