hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Dejan Menges <dejan.men...@gmail.com>
Subject Re: Strange issue when DataNode goes down
Date Mon, 23 Mar 2015 09:03:58 GMT
Hi Nicolas,

Please find log attached.

As I see it now more clearly, it was trying to recover HDFS WALs from node
that's dead:

2015-03-23 08:53:44,381 WARN org.apache.hadoop.hbase.util.FSHDFSUtils:
Cannot recoverLease after trying for 900000ms
(hbase.lease.recovery.timeout); continuing, but may be DATALOSS!!!;
attempt=40 on
file=hdfs://{my_hmaster_node}:8020/hbase/WALs/{node_i_intentionally_get_down_by_getting_network_down},60020,1426862900506-splitting/{node_i_intentionally_get_down_by_getting_network_down}%2C60020%2C1426862900506.1427096924508
after 908210ms

And as you can see from the log, it tried 40 times, what took it exactly 15
minutes.

There's probably some parameter to tune to cut it of from 40 times / 15
minutes to something more useful, as for 15 minutes we don't have our
regions available, and HDFS have however replication factor 3.

Googling, if I figure out what's this I will post it here. Will also
appreciate if someone knows how to cut this down.

Thanks,

Dejan

On Fri, Mar 20, 2015 at 3:49 PM Nicolas Liochon <nkeywal@gmail.com> wrote:

> The split is done by the region servers (the master coordinates). Is there
> some interesting stuff in their logs?
>
> On Fri, Mar 20, 2015 at 3:38 PM, Dejan Menges <dejan.menges@gmail.com>
> wrote:
>
> > With client issue was that it was retrying connecting to the same region
> > servers even when they were reassigned. Lowering it down helped in this
> > specific use case, but yes, we still want servers to reallocate quickly.
> >
> > What got me here:
> >
> > http://hbase.apache.org/book.html#mttr
> >
> > I basically set configuration exactly the same way as it's explained
> here.
> > *zookeeper.session.timeout* is (and was before) 60000 (one minute).
> >
> > So basically what happens is: - simulating network issues we had
> recently).
> > - After short time I see in HBase that my RegionServer is dead, and as
> > total number of regions I see previous total minus number of regions that
> > were hosted on the node hosting RegionServer that just 'disappeared'.
> > - In this point I want my clus
> >
> > - I have test cluster consisting of four nodes, every node being DataNode
> > and RegionServer.
> > - I simulate network partition on one (connect to it through console and
> > take network interface downter to recover as soon as possible, to start
> > serving missing regions.
> > - First thing I see in HMaster logs are:
> >
> > 2015-03-20 14:17:26,015 INFO
> > org.apache.hadoop.hbase.zookeeper.RegionServerTracker: RegionServer
> > ephemeral node deleted, processing expiration
> > [{name_of_node_I_took_down},60020,1426860403261]
> >
> > 2015-03-20 14:17:26,067 INFO
> > org.apache.hadoop.hbase.master.handler.ServerShutdownHandler: Splitting
> > logs for {name_of_node_I_took_down},60020,1426860403261 before
> assignment.
> >
> > 2015-03-20 14:17:26,105 INFO
> > org.apache.hadoop.hbase.master.SplitLogManager: dead splitlog workers [
> > {name_of_node_I_took_down},60020,1426860403261]
> >
> > 2015-03-20 14:17:26,107 INFO
> > org.apache.hadoop.hbase.master.SplitLogManager: started splitting 1
> logs in
> > [hdfs://{fqdn_of_hmaster}:8020/hbase/WALs/{name_of_node_I_took_down}
> > ,60020,1426860403261-splitting]
> >
> > 2015-03-20 14:17:26,150 INFO
> > org.apache.hadoop.hbase.master.SplitLogManager: task
> > /hbase/splitWAL/WALs%2F
> > {name_of_node_I_took_down}%2C60020%2C1426860403261-splitting%2F
> > {name_of_node_I_took_down}%252C60020%252C1426860403261.1426860404905
> > acquired by {fqdn_of_another_live_hnode},60020,1426859445623
> >
> > 2015-03-20 14:17:26,182 INFO
> > org.apache.hadoop.hbase.master.SplitLogManager: total tasks = 1
> unassigned
> > = 0 tasks={/hbase/splitWAL/WALs%2F{name_of_node_I_took_down}
> >
> > %2C60020%2C1426860403261-splitting%2F{name_of_node_I_
> took_down}%252C60020%252C1426860403261.1426860404905=last_update
> > = 1426861046182 last_version = 2 cur_worker_name =
> > {fqdn_of_another_live_node},60020,1426859445623 status = in_progress
> > incarnation = 0 resubmits = 0 batch = installed = 1 done = 0 error = 0}
> >
> > 2015-03-20 14:17:31,183 INFO
> > org.apache.hadoop.hbase.master.SplitLogManager: total tasks = 1
> unassigned
> > = 0 tasks={/hbase/splitWAL/WALs%2F{name_of_node_I_took_down}
> >
> > %2C60020%2C1426860403261-splitting%2F{name_of_node_I_
> took_down}%252C60020%252C1426860403261.1426860404905=last_update
> > = 1426861046182 last_version = 2 cur_worker_name =
> > {fqdn_of_another_live_node},60020,1426859445623 status = in_progress
> > incarnation = 0 resubmits = 0 batch = installed = 1 done = 0 error = 0}
> >
> > 2015-03-20 14:17:36,184 INFO
> > org.apache.hadoop.hbase.master.SplitLogManager: total tasks = 1
> unassigned
> > = 0 tasks={/hbase/splitWAL/WALs%2F{name_of_node_I_took_down}
> >
> > %2C60020%2C1426860403261-splitting%2F{name_of_node_I_
> took_down}%252C60020%252C1426860403261.1426860404905=last_update
> > = 1426861046182 last_version = 2 cur_worker_name =
> > {fqdn_of_another_live_node},60020,1426859445623 status = in_progress
> > incarnation = 0 resubmits = 0 batch = installed = 1 done = 0 error = 0}
> >
> > 2015-03-20 14:17:42,185 INFO
> > org.apache.hadoop.hbase.master.SplitLogManager: total tasks = 1
> unassigned
> > = 0 tasks={/hbase/splitWAL/WALs%2F{name_of_node_I_took_down}
> >
> > %2C60020%2C1426860403261-splitting%2F{name_of_node_I_
> took_down}%252C60020%252C1426860403261.1426860404905=last_update
> > = 1426861046182 last_version = 2 cur_worker_name =
> > {fqdn_of_another_live_node},60020,1426859445623 status = in_progress
> > incarnation = 0 resubmits = 0 batch = installed = 1 done = 0 error = 0}
> >
> > 2015-03-20 14:17:48,184 INFO
> > org.apache.hadoop.hbase.master.SplitLogManager: total tasks = 1
> unassigned
> > = 0 tasks={/hbase/splitWAL/WALs%2F{name_of_node_I_took_down}
> >
> > %2C60020%2C1426860403261-splitting%2F{name_of_node_I_
> took_down}%252C60020%252C1426860403261.1426860404905=last_update
> > = 1426861046182 last_version = 2 cur_worker_name =
> > {fqdn_of_another_live_node},60020,1426859445623 status = in_progress
> > incarnation = 0 resubmits = 0 batch = installed = 1 done = 0 error = 0}
> > In the meantime, In hbase...out log I got this:
> >
> > ==> hbase-hbase-master-{fqdn_of_my_hmaster_node}.out <==
> >
> > java.io.IOException: Call to
> > {name_of_node_I_took_down}/{ip_of_local_interface_I_took_down}:60020
> > failed on local exception:
> > org.apache.hadoop.hbase.ipc.RpcClient$CallTimeoutException: Call
> > id=93152, waitTime=60044, rpcTimeout=60000
> >
> > at org.apache.hadoop.hbase.ipc.RpcClient.wrapException(
> RpcClient.java:1532)
> >
> > at org.apache.hadoop.hbase.ipc.RpcClient.call(RpcClient.java:1502)
> >
> > at
> >
> > org.apache.hadoop.hbase.ipc.RpcClient.callBlockingMethod(
> RpcClient.java:1684)
> >
> > at
> >
> > org.apache.hadoop.hbase.ipc.RpcClient$BlockingRpcChannelImplementati
> on.callBlockingMethod(RpcClient.java:1737)
> >
> > at
> >
> > org.apache.hadoop.hbase.protobuf.generated.AdminProtos$AdminService$
> BlockingStub.getRegionInfo(AdminProtos.java:20806)
> >
> > at
> >
> > org.apache.hadoop.hbase.client.HBaseAdmin.getCompactionState(HBaseAdmin.
> java:2524)
> >
> > at
> >
> > org.apache.hadoop.hbase.generated.master.table_jsp._
> jspService(table_jsp.java:167)
> >
> > at org.apache.jasper.runtime.HttpJspBase.service(HttpJspBase.java:98)
> >
> > at javax.servlet.http.HttpServlet.service(HttpServlet.java:770)
> >
> > at org.mortbay.jetty.servlet.ServletHolder.handle(
> ServletHolder.java:511)
> >
> > at
> >
> > org.mortbay.jetty.servlet.ServletHandler$CachedChain.
> doFilter(ServletHandler.java:1221)
> >
> > at
> >
> > org.apache.hadoop.http.lib.StaticUserWebFilter$
> StaticUserFilter.doFilter(StaticUserWebFilter.java:109)
> >
> > at
> >
> > org.mortbay.jetty.servlet.ServletHandler$CachedChain.
> doFilter(ServletHandler.java:1212)
> >
> > at
> >
> > org.apache.hadoop.http.HttpServer$QuotingInputFilter.
> doFilter(HttpServer.java:1081)
> >
> > at
> >
> > org.mortbay.jetty.servlet.ServletHandler$CachedChain.
> doFilter(ServletHandler.java:1212)
> >
> > at org.apache.hadoop.http.NoCacheFilter.doFilter(NoCacheFilter.java:45)
> >
> > at
> >
> > org.mortbay.jetty.servlet.ServletHandler$CachedChain.
> doFilter(ServletHandler.java:1212)
> >
> > at org.mortbay.jetty.servlet.ServletHandler.handle(
> ServletHandler.java:399)
> >
> > at
> > org.mortbay.jetty.security.SecurityHandler.handle(
> SecurityHandler.java:216)
> >
> > at org.mortbay.jetty.servlet.SessionHandler.handle(
> SessionHandler.java:182)
> >
> > at org.mortbay.jetty.handler.ContextHandler.handle(
> ContextHandler.java:766)
> >
> > at org.mortbay.jetty.webapp.WebAppContext.handle(WebAppContext.java:450)
> >
> > at
> >
> > org.mortbay.jetty.handler.ContextHandlerCollection.handle(
> ContextHandlerCollection.java:230)
> >
> > at org.mortbay.jetty.handler.HandlerWrapper.handle(
> HandlerWrapper.java:152)
> >
> > at org.mortbay.jetty.Server.handle(Server.java:326)
> >
> > at org.mortbay.jetty.HttpConnection.handleRequest(
> HttpConnection.java:542)
> >
> > at
> >
> > org.mortbay.jetty.HttpConnection$RequestHandler.
> headerComplete(HttpConnection.java:928)
> >
> > at org.mortbay.jetty.HttpParser.parseNext(HttpParser.java:549)
> >
> > at org.mortbay.jetty.HttpParser.parseAvailable(HttpParser.java:212)
> >
> > at org.mortbay.jetty.HttpConnection.handle(HttpConnection.java:404)
> >
> > at
> >
> > org.mortbay.io.nio.SelectChannelEndPoint.run(SelectChannelEndPoint.java:
> 410)
> >
> > at
> >
> > org.mortbay.thread.QueuedThreadPool$PoolThread.
> run(QueuedThreadPool.java:582)
> >
> > Caused by: org.apache.hadoop.hbase.ipc.RpcClient$CallTimeoutException:
> Call
> > id=93152, waitTime=60044, rpcTimeout=60000
> >
> > at
> >
> > org.apache.hadoop.hbase.ipc.RpcClient$Connection.
> cleanupCalls(RpcClient.java:1234)
> >
> > at
> >
> > org.apache.hadoop.hbase.ipc.RpcClient$Connection.
> readResponse(RpcClient.java:1171)
> >
> > at org.apache.hadoop.hbase.ipc.RpcClient$Connection.run(
> RpcClient.java:751)
> > Beside this same issue, please note that first message was at 2015-03-20
> > 14:17:26,015. And then (we got to the point when it started transition):
> >
> > 2015-03-20 14:32:35,059 INFO
> > org.apache.hadoop.hbase.master.SplitLogManager: task
> > /hbase/splitWAL/WALs%2F
> > {name_of_node_I_took_down}%2C60020%2C1426860403261-splitting%2F
> > {name_of_node_I_took_down}%252C60020%252C1426860403261.1426860404905
> > entered state: DONE {fqdn_of_new_live_node},60020,1426859445623
> >
> > 2015-03-20 14:32:35,109 INFO
> > org.apache.hadoop.hbase.master.SplitLogManager: Done splitting
> > /hbase/splitWAL/WALs%2F{name_of_node_I_took_down}
> > %2C60020%2C1426860403261-splitting%2F{name_of_node_I_took_down}
> > %252C60020%252C1426860403261.1426860404905
> >
> > 2015-03-20 14:32:35,190 INFO
> > org.apache.hadoop.hbase.master.SplitLogManager: finished splitting (more
> > than or equal to) 9 bytes in 1 log files in
> >
> > [hdfs://{fqdn_of_my_hmaster_node}:8020/hbase/WALs/{name_
> of_node_I_took_down},60020,1426860403261-splitting]
> > in 909083ms
> >
> > 2015-03-20 14:32:35,191 INFO org.apache.hadoop.hbase.
> master.RegionStates:
> > Transitioned {0e7cc87a4ef6c47a779186f5bf79a01c state=OPEN,
> > ts=1426860639088, server={name_of_node_I_took_down},60020,1426860403261}
> to
> > {0e7cc87a4ef6c47a779186f5bf79a01c state=OFFLINE, ts=1426861955191,
> server=
> > {name_of_node_I_took_down},60020,1426860403261}
> >
> > 2015-03-20 14:32:35,191 INFO org.apache.hadoop.hbase.
> master.RegionStates:
> > Offlined 0e7cc87a4ef6c47a779186f5bf79a01c from
> {name_of_node_I_took_down}
> > ,60020,1426860403261
> >
> > 2015-03-20 14:32:35,191 INFO org.apache.hadoop.hbase.
> master.RegionStates:
> > Transitioned {25ab6e7b42e36ddaa723d71be5954543 state=OPEN,
> > ts=1426860641783, server={name_of_node_I_took_down},60020,1426860403261}
> to
> > {25ab6e7b42e36ddaa723d71be5954543 state=OFFLINE, ts=1426861955191,
> server=
> > {name_of_node_I_took_down},60020,1426860403261}
> >
> > 2015-03-20 14:32:35,191 INFO org.apache.hadoop.hbase.
> master.RegionStates:
> > Offlined 25ab6e7b42e36ddaa723d71be5954543 from
> {name_of_node_I_took_down}
> > ,60020,1426860403261
> > At this point, note that it finished SplitLogManager task at 14:32:35 and
> > started transitioning just after that. So this is 15 minutes that I'm
> > talking about.
> >
> > What am I missing?
> >
> >
> > On Fri, Mar 20, 2015 at 2:37 PM Nicolas Liochon <nkeywal@gmail.com>
> wrote:
> >
> > > You've changed the value of hbase.zookeeper.timeout to 15 minutes? A
> very
> > > reasonable target is 1 minute before relocating the regions. That's the
> > > default iirc. You can push it to 20s, but then gc-stopping-the-world
> > > becomes more of an issue. 15 minutes is really a lot. The hdfs stale
> mode
> > > must always be used, with a lower timeout than the hbase one.
> > >
> > > Client side there should be nothing to do (excepted for advanced
> stuff);
> > at
> > > each retry the client checks the location of the regions. If you lower
> > the
> > > number of retry the client will fail sooner, but usually you don't want
> > the
> > > client to fail, you want the servers to reallocate quickly.
> > >
> > > On Fri, Mar 20, 2015 at 1:36 PM, Dejan Menges <dejan.menges@gmail.com>
> > > wrote:
> > >
> > > > Hi,
> > > >
> > > > Sorry for little bit late update, but managed to narrow it little bit
> > > down.
> > > >
> > > > We didn't update yet, as we are using Hortonworks distribution right
> > now,
> > > > and even if we update we will get 0.98.4. However, looks that issue
> > here
> > > > was in our use case and configuration (still looking into it).
> > > >
> > > > Basically, initially I saw that when one server goes down, we start
> > > having
> > > > performance issues in general, but it managed to be on our client
> side,
> > > due
> > > > to caching, and clients were trying to reconnect to nodes that were
> > > offline
> > > > and later trying to get regions from those nodes too. This is
> basically
> > > why
> > > > on server side I didn't manage to see anything in logs that would be
> at
> > > > least little bit interesting and point me into desired direction.
> > > >
> > > > Another question that popped up to me is - in case server is down
> (and
> > > with
> > > > it DataNode and HRegionServer it was hosting) - what's optimal time
> to
> > > set
> > > > for HMaster to consider server dead reassign regions somewhere else,
> as
> > > > this is another performance bottleneck we hit during inability to
> > access
> > > > regions? In our case it's configured to 15 minutes, and simple logic
> > > tells
> > > > me if you want it earlier then configure lower number of retries, but
> > > issue
> > > > is as always in details, so not sure if anyone knows some better math
> > for
> > > > this?
> > > >
> > > > And last question - is it possible to manually force HBase to
> reassign
> > > > regions? In this case, while HMaster is retrying to contact node
> that's
> > > > dead, it's impossible to force it using 'balancer' command.
> > > >
> > > > Thanks a lot!
> > > >
> > > > Dejan
> > > >
> > > > On Tue, Mar 17, 2015 at 9:37 AM Dejan Menges <dejan.menges@gmail.com
> >
> > > > wrote:
> > > >
> > > > > Hi,
> > > > >
> > > > > To be very honest - there's no particular reason why we stick to
> this
> > > > one,
> > > > > beside just lack of time currently to go through upgrade process,
> but
> > > > looks
> > > > > to me that's going to be next step.
> > > > >
> > > > > Had a crazy day, didn't have time to go through all logs again,
> plus
> > > one
> > > > > of the machines (last one where we had this issue) is fully
> > > reprovisioned
> > > > > yesterday so I don't have logs from there anymore.
> > > > >
> > > > > Beside upgrading,  what I will talk about today, can you just point
> > me
> > > to
> > > > > the specific RPC issue in 0.98.0? Thing is that we have some
> strange
> > > > > moments with RPC in this case, and just want to see if that's the
> > same
> > > > > thing (and we were even suspecting to RPC).
> > > > >
> > > > > Thanks a lot!
> > > > > Dejan
> > > > >
> > > > > On Mon, Mar 16, 2015 at 9:32 PM, Andrew Purtell <
> apurtell@apache.org
> > >
> > > > > wrote:
> > > > >
> > > > >> Is there a particular reason why you are using HBase 0.98.0?
The
> > > latest
> > > > >> 0.98 release is 0.98.11. There's a known performance issue with
> > 0.98.0
> > > > >> pertaining to RPC that was fixed in later releases, you should
> move
> > up
> > > > >> from
> > > > >> 0.98.0. In addition hundreds of improvements and bug fixes have
> gone
> > > > into
> > > > >> the ten releases since 0.98.0.
> > > > >>
> > > > >> On Mon, Mar 16, 2015 at 6:40 AM, Dejan Menges <
> > dejan.menges@gmail.com
> > > >
> > > > >> wrote:
> > > > >>
> > > > >> > Hi All,
> > > > >> >
> > > > >> > We have a strange issue with HBase performance (overall
cluster
> > > > >> > performance) in case one of datanodes in the cluster
> unexpectedly
> > > goes
> > > > >> > down.
> > > > >> >
> > > > >> > So scenario is like follows:
> > > > >> > - Cluster works fine, it's stable.
> > > > >> > - One DataNode unexpectedly goes down (PSU issue, network
issue,
> > > > >> anything)
> > > > >> > - Whole HBase cluster goes down (performance becomes so
bad that
> > we
> > > > >> have to
> > > > >> > restart all RegionServers to get it back to life).
> > > > >> >
> > > > >> > Most funny and latest issue that happened was that we added
new
> > node
> > > > to
> > > > >> the
> > > > >> > cluster (having 8 x 4T SATA disks) and we left just DataNode
> > running
> > > > on
> > > > >> it
> > > > >> > to give it couple of days to get some data. At some point
in
> time,
> > > due
> > > > >> to
> > > > >> > hardware issue, server rebooted (twice during three hours)
in
> > moment
> > > > >> when
> > > > >> > it had maybe 5% of data it would have in a couple of days.
> Nothing
> > > > else
> > > > >> > beside DataNode was running, and once it went down, it affected
> > > > literary
> > > > >> > everything, and restarting RegionServers in the end fixed
it.
> > > > >> >
> > > > >> > We are using HBase 0.98.0 with Hadoop 2.4.0
> > > > >> >
> > > > >>
> > > > >>
> > > > >>
> > > > >> --
> > > > >> Best regards,
> > > > >>
> > > > >>    - Andy
> > > > >>
> > > > >> Problems worthy of attack prove their worth by hitting back.
-
> Piet
> > > Hein
> > > > >> (via Tom White)
> > > > >>
> > > > >
> > > > >
> > > >
> > >
> >
>

Mime
  • Unnamed multipart/mixed (inline, None, 0 bytes)
View raw message