hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Dejan Menges <dejan.men...@gmail.com>
Subject Re: Strange issue when DataNode goes down
Date Mon, 23 Mar 2015 14:05:22 GMT
Will take a look.

Actually, if node is down (someone unplugged network cable, it just died,
whatever) what's chance it's going to become live so write can continue? On
the other side, HBase is not starting recovery trying to contact node which
is not there anymore, and even elected as dead on Namenode side (another
thing I didn't understood quite good).

So what I was expecting is that as soon as Namenode decided node is dead,
that it would be enough for RegionServer to stop trying to recover from the
dead node, but it wasn't the case. Also, this whole MTTR article in HBase
book doesn't work at all with this parameter set to it's default value (15
minutes).

So I'm still struggling to figure out what's drawback exactly on this?

On Mon, Mar 23, 2015 at 2:50 PM Nicolas Liochon <nkeywal@gmail.com> wrote:

> Thanks for the explanation. There is an issue if you modify this setting
> however.
> hbase tries to recover the lease (i.e. be sure that nobody is writing)
> If you change hbase.lease.recovery.timeout hbase will start the recovery
> (i.e. start to read) even if it's not sure that nobody's writing. That
> means there is a dataloss risk.
> Basically, you must not see this warning: WARN
> org.apache.hadoop.hbase.util.FSHDFSUtils: Cannot recoverLease after trying
> for[]
>
> The recoverLease must succeed. The fact that it does not after X tries is
> strange.
> There may be a mistmatch between the hbase parameters and the hdfs ones.
> You may need to have a look at the comments in FSHDFSUtils.java
>
>
>
>
> On Mon, Mar 23, 2015 at 2:15 PM, Dejan Menges <dejan.menges@gmail.com>
> wrote:
>
> > I found the issue and fixed it, and will try to explain here what was
> > exactly in our case in case someone else finds this interesting too.
> >
> > So initially, we had (couple of times) some network and hardware issues
> in
> > our datacenters. When one server would die (literary die, we had some
> issue
> > with PSUs) we saw issues with overall cluster performance on HBase side.
> As
> > cluster is quite big and live, it was also quite hard to figure out exact
> > root cause and how to fix the stuff we wanted to fix.
> >
> > So I set up another cluster, four nodes (with DataNode and RegionServer)
> > and two other nodes with HMaster and Namenode in HA, using same stuff we
> > use on production. We pumped some data into it, and I was able to
> reproduce
> > same issue last week on it. What I tried to do is to cut one server (shut
> > down it's interface) when all is good with cluster, when we have load,
> and
> > see what's going to happen.
> >
> > On Friday, after Nicolas mentioned, I started taking a look in HBase logs
> > on the node which was mentioned in HMaster log as the one taking over
> > regions for the dead server. Basically what I was able to observe was 15
> > minutes time (+- couple of seconds, what was also interesting, and will
> got
> > later to that) between HMaster figures out that one of it's RegionServers
> > is dead, and the time one of the mentioned nodes starts taking over those
> > regions and they start appearing in HMaster's Web UI.
> >
> > I then set up everything like mentioned here
> > http://hbase.apache.org/book.html#mttr - but still had exactly the same
> > issues. Went over (again and again) all currently configured stuff, but
> > still had the same issue.
> >
> > Then I started looking into HDFS. Opened NameNode UI, saw all is good,
> took
> > one node down, was also looking RegionServer logs in the same time - and
> I
> > also see that it took ~15 minutes for Namenode to elect dead node as
> dead.
> > Somehow in the same moment regions started getting back to life. I
> remember
> > in some older versions dfs timeout checks and check retries. Looked into
> > defaults for our Hadoop version -
> >
> > http://hadoop.apache.org/docs/r2.4.1/hadoop-project-dist/
> hadoop-hdfs/hdfs-default.xml
> > - and saw there that there's no recheck parameter anymore. Strange, as on
> > StackOverflow I found post from month ago, for newer version than we use
> > (we use 2.4.1, guy was using 2.6 - dfs.namenode.heartbeat.
> recheck-interval)
> > I set it to 10 seconds as he mentioned, having checks every three seconds
> > (default) and got DataNode elected as dead in ~45 seconds, as he
> mentioned.
> > Not sure why this parameter is not documented, but obviously it works.
> >
> > Then figured out it still didn't fix our HBase failover issue. I was
> > looking into HBase book again and again, and then saw this part:
> >
> > "How much time we allow elapse between calls to recover lease. Should be
> > larger than the dfs timeout."
> >
> > This was description for hbase.lease.recovery.dfs.timeout. Wasn't sure
> from
> > the comment what of all timeouts that's possible to set in Hadoop/HBase
> and
> > that has something to do with DFS is this all about. But picked
> > hbase.lease.recovery.timeout, and that was the catch.
> >
> > Initially, by default, hbase.lease.recovery.timeout is set to 15 minutes.
> > Not sure why, and wasn't able to find yet why, but getting this down to
> one
> > minute (what's more than OK for us) I was able to get rid of our issue.
> Not
> > also sure why this is not mentioned in MTTR section in HBase book, as
> > obviously MTTR doesn't work at all with this default timeout, at least it
> > doesn't work the way we expected it to work.
> >
> > So thanks again for everyone being spammed with this, and specially
> thanks
> > to Nicolas pointing me to the right direction.
> >
> > On Mon, Mar 23, 2015 at 1:37 PM Nicolas Liochon <nkeywal@gmail.com>
> wrote:
> >
> > > the attachments are rejected by the mailing list, can you put then on
> > > pastebin?
> > >
> > > stale is mandatory (so it's good), but the issue here is just before.
> The
> > > region server needs to read the file. In order to be sure that there is
> > no
> > > data loss, it needs to "recover the lease", that means ensuring that
> > nobody
> > > is writing the file. The regionserver calls the namenode to do this
> > > recoverLease. So there should be some info in the namenode logs. You
> have
> > > HDFS-4721 on your hdfs? The hbase book details (more or less...) this
> > > recoverLease stuff.
> > >
> > >
> > > On Mon, Mar 23, 2015 at 10:33 AM, Dejan Menges <dejan.menges@gmail.com
> >
> > > wrote:
> > >
> > > > And also, just checked - dfs.namenode.avoid.read.stale.datanode and
> > > > dfs.namenode.avoid.write.stale.datanode
> > > > are both true, and dfs.namenode.stale.datanode.interval is set to
> > > default
> > > > 30000.
> > > >
> > > > On Mon, Mar 23, 2015 at 10:03 AM Dejan Menges <
> dejan.menges@gmail.com>
> > > > wrote:
> > > >
> > > > > Hi Nicolas,
> > > > >
> > > > > Please find log attached.
> > > > >
> > > > > As I see it now more clearly, it was trying to recover HDFS WALs
> from
> > > > node
> > > > > that's dead:
> > > > >
> > > > > 2015-03-23 08:53:44,381 WARN
> > org.apache.hadoop.hbase.util.FSHDFSUtils:
> > > > > Cannot recoverLease after trying for 900000ms
> > > > > (hbase.lease.recovery.timeout); continuing, but may be
> DATALOSS!!!;
> > > > > attempt=40 on
> > > > >
> > > > file=hdfs://{my_hmaster_node}:8020/hbase/WALs/{node_i_intent
> > > ionally_get_down_by_getting_network_down},60020,1426862900
> > > 506-splitting/{node_i_intentionally_get_down_by_gett
> > > ing_network_down}%2C60020%2C1426862900506.1427096924508
> > > > > after 908210ms
> > > > >
> > > > > And as you can see from the log, it tried 40 times, what took it
> > > exactly
> > > > > 15 minutes.
> > > > >
> > > > > There's probably some parameter to tune to cut it of from 40 times
> /
> > 15
> > > > > minutes to something more useful, as for 15 minutes we don't have
> our
> > > > > regions available, and HDFS have however replication factor 3.
> > > > >
> > > > > Googling, if I figure out what's this I will post it here. Will
> also
> > > > > appreciate if someone knows how to cut this down.
> > > > >
> > > > > Thanks,
> > > > >
> > > > > Dejan
> > > > >
> > > > > On Fri, Mar 20, 2015 at 3:49 PM Nicolas Liochon <nkeywal@gmail.com
> >
> > > > wrote:
> > > > >
> > > > >> The split is done by the region servers (the master coordinates).
> Is
> > > > there
> > > > >> some interesting stuff in their logs?
> > > > >>
> > > > >> On Fri, Mar 20, 2015 at 3:38 PM, Dejan Menges <
> > dejan.menges@gmail.com
> > > >
> > > > >> wrote:
> > > > >>
> > > > >> > With client issue was that it was retrying connecting to
the
> same
> > > > region
> > > > >> > servers even when they were reassigned. Lowering it down
helped
> in
> > > > this
> > > > >> > specific use case, but yes, we still want servers to reallocate
> > > > quickly.
> > > > >> >
> > > > >> > What got me here:
> > > > >> >
> > > > >> > http://hbase.apache.org/book.html#mttr
> > > > >> >
> > > > >> > I basically set configuration exactly the same way as it's
> > explained
> > > > >> here.
> > > > >> > *zookeeper.session.timeout* is (and was before) 60000 (one
> > minute).
> > > > >> >
> > > > >> > So basically what happens is: - simulating network issues
we had
> > > > >> recently).
> > > > >> > - After short time I see in HBase that my RegionServer is
dead,
> > and
> > > as
> > > > >> > total number of regions I see previous total minus number
of
> > regions
> > > > >> that
> > > > >> > were hosted on the node hosting RegionServer that just
> > > 'disappeared'.
> > > > >> > - In this point I want my clus
> > > > >> >
> > > > >> > - I have test cluster consisting of four nodes, every node
being
> > > > >> DataNode
> > > > >> > and RegionServer.
> > > > >> > - I simulate network partition on one (connect to it through
> > console
> > > > and
> > > > >> > take network interface downter to recover as soon as possible,
> to
> > > > start
> > > > >> > serving missing regions.
> > > > >> > - First thing I see in HMaster logs are:
> > > > >> >
> > > > >> > 2015-03-20 14:17:26,015 INFO
> > > > >> > org.apache.hadoop.hbase.zookeeper.RegionServerTracker:
> > RegionServer
> > > > >> > ephemeral node deleted, processing expiration
> > > > >> > [{name_of_node_I_took_down},60020,1426860403261]
> > > > >> >
> > > > >> > 2015-03-20 14:17:26,067 INFO
> > > > >> > org.apache.hadoop.hbase.master.handler.ServerShutdownHandler:
> > > > Splitting
> > > > >> > logs for {name_of_node_I_took_down},60020,1426860403261
before
> > > > >> assignment.
> > > > >> >
> > > > >> > 2015-03-20 14:17:26,105 INFO
> > > > >> > org.apache.hadoop.hbase.master.SplitLogManager: dead splitlog
> > > workers
> > > > [
> > > > >> > {name_of_node_I_took_down},60020,1426860403261]
> > > > >> >
> > > > >> > 2015-03-20 14:17:26,107 INFO
> > > > >> > org.apache.hadoop.hbase.master.SplitLogManager: started
> splitting
> > 1
> > > > >> logs in
> > > > >> > [hdfs://{fqdn_of_hmaster}:8020/hbase/WALs/{name_of_node_I_
> > > took_down}
> > > > >> > ,60020,1426860403261-splitting]
> > > > >> >
> > > > >> > 2015-03-20 14:17:26,150 INFO
> > > > >> > org.apache.hadoop.hbase.master.SplitLogManager: task
> > > > >> > /hbase/splitWAL/WALs%2F
> > > > >> > {name_of_node_I_took_down}%2C60020%2C1426860403261-splitting%2F
> > > > >> > {name_of_node_I_took_down}%252C60020%252C1426860403261.14268
> > > 60404905
> > > > >> > acquired by {fqdn_of_another_live_hnode},60020,1426859445623
> > > > >> >
> > > > >> > 2015-03-20 14:17:26,182 INFO
> > > > >> > org.apache.hadoop.hbase.master.SplitLogManager: total tasks
= 1
> > > > >> unassigned
> > > > >> > = 0 tasks={/hbase/splitWAL/WALs%2F{name_of_node_I_took_down}
> > > > >> >
> > > > >> > %2C60020%2C1426860403261-splitting%2F{name_of_node_I_took_
> > > > >> down}%252C60020%252C1426860403261.1426860404905=last_update
> > > > >> > = 1426861046182 last_version = 2 cur_worker_name =
> > > > >> > {fqdn_of_another_live_node},60020,1426859445623 status =
> > > in_progress
> > > > >> > incarnation = 0 resubmits = 0 batch = installed = 1 done
= 0
> > error =
> > > > 0}
> > > > >> >
> > > > >> > 2015-03-20 14:17:31,183 INFO
> > > > >> > org.apache.hadoop.hbase.master.SplitLogManager: total tasks
= 1
> > > > >> unassigned
> > > > >> > = 0 tasks={/hbase/splitWAL/WALs%2F{name_of_node_I_took_down}
> > > > >> >
> > > > >> > %2C60020%2C1426860403261-splitting%2F{name_of_node_I_took_
> > > > >> down}%252C60020%252C1426860403261.1426860404905=last_update
> > > > >> > = 1426861046182 last_version = 2 cur_worker_name =
> > > > >> > {fqdn_of_another_live_node},60020,1426859445623 status =
> > > in_progress
> > > > >> > incarnation = 0 resubmits = 0 batch = installed = 1 done
= 0
> > error =
> > > > 0}
> > > > >> >
> > > > >> > 2015-03-20 14:17:36,184 INFO
> > > > >> > org.apache.hadoop.hbase.master.SplitLogManager: total tasks
= 1
> > > > >> unassigned
> > > > >> > = 0 tasks={/hbase/splitWAL/WALs%2F{name_of_node_I_took_down}
> > > > >> >
> > > > >> > %2C60020%2C1426860403261-splitting%2F{name_of_node_I_took_
> > > > >> down}%252C60020%252C1426860403261.1426860404905=last_update
> > > > >> > = 1426861046182 last_version = 2 cur_worker_name =
> > > > >> > {fqdn_of_another_live_node},60020,1426859445623 status =
> > > in_progress
> > > > >> > incarnation = 0 resubmits = 0 batch = installed = 1 done
= 0
> > error =
> > > > 0}
> > > > >> >
> > > > >> > 2015-03-20 14:17:42,185 INFO
> > > > >> > org.apache.hadoop.hbase.master.SplitLogManager: total tasks
= 1
> > > > >> unassigned
> > > > >> > = 0 tasks={/hbase/splitWAL/WALs%2F{name_of_node_I_took_down}
> > > > >> >
> > > > >> > %2C60020%2C1426860403261-splitting%2F{name_of_node_I_took_
> > > > >> down}%252C60020%252C1426860403261.1426860404905=last_update
> > > > >> > = 1426861046182 last_version = 2 cur_worker_name =
> > > > >> > {fqdn_of_another_live_node},60020,1426859445623 status =
> > > in_progress
> > > > >> > incarnation = 0 resubmits = 0 batch = installed = 1 done
= 0
> > error =
> > > > 0}
> > > > >> >
> > > > >> > 2015-03-20 14:17:48,184 INFO
> > > > >> > org.apache.hadoop.hbase.master.SplitLogManager: total tasks
= 1
> > > > >> unassigned
> > > > >> > = 0 tasks={/hbase/splitWAL/WALs%2F{name_of_node_I_took_down}
> > > > >> >
> > > > >> > %2C60020%2C1426860403261-splitting%2F{name_of_node_I_took_
> > > > >> down}%252C60020%252C1426860403261.1426860404905=last_update
> > > > >> > = 1426861046182 last_version = 2 cur_worker_name =
> > > > >> > {fqdn_of_another_live_node},60020,1426859445623 status =
> > > in_progress
> > > > >> > incarnation = 0 resubmits = 0 batch = installed = 1 done
= 0
> > error =
> > > > 0}
> > > > >> > In the meantime, In hbase...out log I got this:
> > > > >> >
> > > > >> > ==> hbase-hbase-master-{fqdn_of_my_hmaster_node}.out
<==
> > > > >> >
> > > > >> > java.io.IOException: Call to
> > > > >> > {name_of_node_I_took_down}/{ip_of_local_interface_I_took_dow
> > > n}:60020
> > > > >> > failed on local exception:
> > > > >> > org.apache.hadoop.hbase.ipc.RpcClient$CallTimeoutException:
> Call
> > > > >> > id=93152, waitTime=60044, rpcTimeout=60000
> > > > >> >
> > > > >> > at org.apache.hadoop.hbase.ipc.RpcClient.wrapException(RpcClien
> > > > >> t.java:1532)
> > > > >> >
> > > > >> > at org.apache.hadoop.hbase.ipc.RpcClient.call(RpcClient.java:
> 1502)
> > > > >> >
> > > > >> > at
> > > > >> >
> > > > >> > org.apache.hadoop.hbase.ipc.RpcClient.callBlockingMethod(Rpc
> > > > >> Client.java:1684)
> > > > >> >
> > > > >> > at
> > > > >> >
> > > > >> > org.apache.hadoop.hbase.ipc.RpcClient$BlockingRpcChannelImpl
> > > ementati
> > > > >> on.callBlockingMethod(RpcClient.java:1737)
> > > > >> >
> > > > >> > at
> > > > >> >
> > > > >> > org.apache.hadoop.hbase.protobuf.generated.AdminProtos$
> > > AdminService$
> > > > >> BlockingStub.getRegionInfo(AdminProtos.java:20806)
> > > > >> >
> > > > >> > at
> > > > >> >
> > > > >> > org.apache.hadoop.hbase.client.HBaseAdmin.getCompactionState
> > > > >> (HBaseAdmin.java:2524)
> > > > >> >
> > > > >> > at
> > > > >> >
> > > > >> > org.apache.hadoop.hbase.generated.master.table_jsp._jspServi
> > > > >> ce(table_jsp.java:167)
> > > > >> >
> > > > >> > at org.apache.jasper.runtime.HttpJspBase.service(HttpJspBase.
> > > java:98)
> > > > >> >
> > > > >> > at javax.servlet.http.HttpServlet.service(HttpServlet.java:770)
> > > > >> >
> > > > >> > at org.mortbay.jetty.servlet.ServletHolder.handle(ServletHolder
> > > > >> .java:511)
> > > > >> >
> > > > >> > at
> > > > >> >
> > > > >> > org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilte
> > > > >> r(ServletHandler.java:1221)
> > > > >> >
> > > > >> > at
> > > > >> >
> > > > >> > org.apache.hadoop.http.lib.StaticUserWebFilter$StaticUserFil
> > > > >> ter.doFilter(StaticUserWebFilter.java:109)
> > > > >> >
> > > > >> > at
> > > > >> >
> > > > >> > org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilte
> > > > >> r(ServletHandler.java:1212)
> > > > >> >
> > > > >> > at
> > > > >> >
> > > > >> > org.apache.hadoop.http.HttpServer$QuotingInputFilter.doFilte
> > > > >> r(HttpServer.java:1081)
> > > > >> >
> > > > >> > at
> > > > >> >
> > > > >> > org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilte
> > > > >> r(ServletHandler.java:1212)
> > > > >> >
> > > > >> > at
> > > > org.apache.hadoop.http.NoCacheFilter.doFilter(NoCacheFilter.java:45)
> > > > >> >
> > > > >> > at
> > > > >> >
> > > > >> > org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilte
> > > > >> r(ServletHandler.java:1212)
> > > > >> >
> > > > >> > at org.mortbay.jetty.servlet.ServletHandler.handle(ServletHandl
> > > > >> er.java:399)
> > > > >> >
> > > > >> > at
> > > > >> > org.mortbay.jetty.security.SecurityHandler.handle(SecurityHa
> > > > >> ndler.java:216)
> > > > >> >
> > > > >> > at org.mortbay.jetty.servlet.SessionHandler.handle(SessionHandl
> > > > >> er.java:182)
> > > > >> >
> > > > >> > at org.mortbay.jetty.handler.ContextHandler.handle(ContextHandl
> > > > >> er.java:766)
> > > > >> >
> > > > >> > at org.mortbay.jetty.webapp.WebAppContext.handle(WebAppContext.
> > > > >> java:450)
> > > > >> >
> > > > >> > at
> > > > >> >
> > > > >> > org.mortbay.jetty.handler.ContextHandlerCollection.handle(Co
> > > > >> ntextHandlerCollection.java:230)
> > > > >> >
> > > > >> > at org.mortbay.jetty.handler.HandlerWrapper.handle(HandlerWrapp
> > > > >> er.java:152)
> > > > >> >
> > > > >> > at org.mortbay.jetty.Server.handle(Server.java:326)
> > > > >> >
> > > > >> > at org.mortbay.jetty.HttpConnection.handleRequest(HttpConnectio
> > > > >> n.java:542)
> > > > >> >
> > > > >> > at
> > > > >> >
> > > > >> > org.mortbay.jetty.HttpConnection$RequestHandler.headerComple
> > > > >> te(HttpConnection.java:928)
> > > > >> >
> > > > >> > at org.mortbay.jetty.HttpParser.parseNext(HttpParser.java:549)
> > > > >> >
> > > > >> > at
> > org.mortbay.jetty.HttpParser.parseAvailable(HttpParser.java:212)
> > > > >> >
> > > > >> > at
> > org.mortbay.jetty.HttpConnection.handle(HttpConnection.java:404)
> > > > >> >
> > > > >> > at
> > > > >> >
> > > > >> > org.mortbay.io.nio.SelectChannelEndPoint.run(SelectChannelEn
> > > > >> dPoint.java:410)
> > > > >> >
> > > > >> > at
> > > > >> >
> > > > >> > org.mortbay.thread.QueuedThreadPool$PoolThread.run(
> > > > >> QueuedThreadPool.java:582)
> > > > >> >
> > > > >> > Caused by: org.apache.hadoop.hbase.ipc.RpcClient$
> > > CallTimeoutException:
> > > > >> Call
> > > > >> > id=93152, waitTime=60044, rpcTimeout=60000
> > > > >> >
> > > > >> > at
> > > > >> >
> > > > >> > org.apache.hadoop.hbase.ipc.RpcClient$Connection.cleanupCall
> > > > >> s(RpcClient.java:1234)
> > > > >> >
> > > > >> > at
> > > > >> >
> > > > >> > org.apache.hadoop.hbase.ipc.RpcClient$Connection.readRespons
> > > > >> e(RpcClient.java:1171)
> > > > >> >
> > > > >> > at org.apache.hadoop.hbase.ipc.RpcClient$Connection.run(RpcClie
> > > > >> nt.java:751)
> > > > >> > Beside this same issue, please note that first message was
at
> > > > 2015-03-20
> > > > >> > 14:17:26,015. And then (we got to the point when it started
> > > > transition):
> > > > >> >
> > > > >> > 2015-03-20 14:32:35,059 INFO
> > > > >> > org.apache.hadoop.hbase.master.SplitLogManager: task
> > > > >> > /hbase/splitWAL/WALs%2F
> > > > >> > {name_of_node_I_took_down}%2C60020%2C1426860403261-splitting%2F
> > > > >> > {name_of_node_I_took_down}%252C60020%252C1426860403261.14268
> > > 60404905
> > > > >> > entered state: DONE {fqdn_of_new_live_node},60020,1426859445623
> > > > >> >
> > > > >> > 2015-03-20 14:32:35,109 INFO
> > > > >> > org.apache.hadoop.hbase.master.SplitLogManager: Done splitting
> > > > >> > /hbase/splitWAL/WALs%2F{name_of_node_I_took_down}
> > > > >> > %2C60020%2C1426860403261-splitting%2F{name_of_node_I_took_down}
> > > > >> > %252C60020%252C1426860403261.1426860404905
> > > > >> >
> > > > >> > 2015-03-20 14:32:35,190 INFO
> > > > >> > org.apache.hadoop.hbase.master.SplitLogManager: finished
> splitting
> > > > >> (more
> > > > >> > than or equal to) 9 bytes in 1 log files in
> > > > >> >
> > > > >> > [hdfs://{fqdn_of_my_hmaster_node}:8020/hbase/WALs/{name_of_
> > > > >> node_I_took_down},60020,1426860403261-splitting]
> > > > >> > in 909083ms
> > > > >> >
> > > > >> > 2015-03-20 14:32:35,191 INFO org.apache.hadoop.hbase.master
> > > > >> .RegionStates:
> > > > >> > Transitioned {0e7cc87a4ef6c47a779186f5bf79a01c state=OPEN,
> > > > >> > ts=1426860639088,
> > > > server={name_of_node_I_took_down},60020,1426860403261}
> > > > >> to
> > > > >> > {0e7cc87a4ef6c47a779186f5bf79a01c state=OFFLINE,
> ts=1426861955191,
> > > > >> server=
> > > > >> > {name_of_node_I_took_down},60020,1426860403261}
> > > > >> >
> > > > >> > 2015-03-20 14:32:35,191 INFO org.apache.hadoop.hbase.master
> > > > >> .RegionStates:
> > > > >> > Offlined 0e7cc87a4ef6c47a779186f5bf79a01c from
> > > > >> {name_of_node_I_took_down}
> > > > >> > ,60020,1426860403261
> > > > >> >
> > > > >> > 2015-03-20 14:32:35,191 INFO org.apache.hadoop.hbase.master
> > > > >> .RegionStates:
> > > > >> > Transitioned {25ab6e7b42e36ddaa723d71be5954543 state=OPEN,
> > > > >> > ts=1426860641783,
> > > > server={name_of_node_I_took_down},60020,1426860403261}
> > > > >> to
> > > > >> > {25ab6e7b42e36ddaa723d71be5954543 state=OFFLINE,
> ts=1426861955191,
> > > > >> server=
> > > > >> > {name_of_node_I_took_down},60020,1426860403261}
> > > > >> >
> > > > >> > 2015-03-20 14:32:35,191 INFO org.apache.hadoop.hbase.master
> > > > >> .RegionStates:
> > > > >> > Offlined 25ab6e7b42e36ddaa723d71be5954543 from
> > > > >> {name_of_node_I_took_down}
> > > > >> > ,60020,1426860403261
> > > > >> > At this point, note that it finished SplitLogManager task
at
> > > 14:32:35
> > > > >> and
> > > > >> > started transitioning just after that. So this is 15 minutes
> that
> > > I'm
> > > > >> > talking about.
> > > > >> >
> > > > >> > What am I missing?
> > > > >> >
> > > > >> >
> > > > >> > On Fri, Mar 20, 2015 at 2:37 PM Nicolas Liochon <
> > nkeywal@gmail.com>
> > > > >> wrote:
> > > > >> >
> > > > >> > > You've changed the value of hbase.zookeeper.timeout
to 15
> > > minutes? A
> > > > >> very
> > > > >> > > reasonable target is 1 minute before relocating the
regions.
> > > That's
> > > > >> the
> > > > >> > > default iirc. You can push it to 20s, but then
> > > gc-stopping-the-world
> > > > >> > > becomes more of an issue. 15 minutes is really a lot.
The hdfs
> > > stale
> > > > >> mode
> > > > >> > > must always be used, with a lower timeout than the
hbase one.
> > > > >> > >
> > > > >> > > Client side there should be nothing to do (excepted
for
> advanced
> > > > >> stuff);
> > > > >> > at
> > > > >> > > each retry the client checks the location of the regions.
If
> you
> > > > lower
> > > > >> > the
> > > > >> > > number of retry the client will fail sooner, but usually
you
> > don't
> > > > >> want
> > > > >> > the
> > > > >> > > client to fail, you want the servers to reallocate
quickly.
> > > > >> > >
> > > > >> > > On Fri, Mar 20, 2015 at 1:36 PM, Dejan Menges <
> > > > dejan.menges@gmail.com
> > > > >> >
> > > > >> > > wrote:
> > > > >> > >
> > > > >> > > > Hi,
> > > > >> > > >
> > > > >> > > > Sorry for little bit late update, but managed
to narrow it
> > > little
> > > > >> bit
> > > > >> > > down.
> > > > >> > > >
> > > > >> > > > We didn't update yet, as we are using Hortonworks
> distribution
> > > > right
> > > > >> > now,
> > > > >> > > > and even if we update we will get 0.98.4. However,
looks
> that
> > > > issue
> > > > >> > here
> > > > >> > > > was in our use case and configuration (still looking
into
> it).
> > > > >> > > >
> > > > >> > > > Basically, initially I saw that when one server
goes down,
> we
> > > > start
> > > > >> > > having
> > > > >> > > > performance issues in general, but it managed
to be on our
> > > client
> > > > >> side,
> > > > >> > > due
> > > > >> > > > to caching, and clients were trying to reconnect
to nodes
> that
> > > > were
> > > > >> > > offline
> > > > >> > > > and later trying to get regions from those nodes
too. This
> is
> > > > >> basically
> > > > >> > > why
> > > > >> > > > on server side I didn't manage to see anything
in logs that
> > > would
> > > > >> be at
> > > > >> > > > least little bit interesting and point me into
desired
> > > direction.
> > > > >> > > >
> > > > >> > > > Another question that popped up to me is - in
case server is
> > > down
> > > > >> (and
> > > > >> > > with
> > > > >> > > > it DataNode and HRegionServer it was hosting)
- what's
> optimal
> > > > time
> > > > >> to
> > > > >> > > set
> > > > >> > > > for HMaster to consider server dead reassign regions
> somewhere
> > > > >> else, as
> > > > >> > > > this is another performance bottleneck we hit
during
> inability
> > > to
> > > > >> > access
> > > > >> > > > regions? In our case it's configured to 15 minutes,
and
> simple
> > > > logic
> > > > >> > > tells
> > > > >> > > > me if you want it earlier then configure lower
number of
> > > retries,
> > > > >> but
> > > > >> > > issue
> > > > >> > > > is as always in details, so not sure if anyone
knows some
> > better
> > > > >> math
> > > > >> > for
> > > > >> > > > this?
> > > > >> > > >
> > > > >> > > > And last question - is it possible to manually
force HBase
> to
> > > > >> reassign
> > > > >> > > > regions? In this case, while HMaster is retrying
to contact
> > node
> > > > >> that's
> > > > >> > > > dead, it's impossible to force it using 'balancer'
command.
> > > > >> > > >
> > > > >> > > > Thanks a lot!
> > > > >> > > >
> > > > >> > > > Dejan
> > > > >> > > >
> > > > >> > > > On Tue, Mar 17, 2015 at 9:37 AM Dejan Menges <
> > > > >> dejan.menges@gmail.com>
> > > > >> > > > wrote:
> > > > >> > > >
> > > > >> > > > > Hi,
> > > > >> > > > >
> > > > >> > > > > To be very honest - there's no particular
reason why we
> > stick
> > > to
> > > > >> this
> > > > >> > > > one,
> > > > >> > > > > beside just lack of time currently to go
through upgrade
> > > > process,
> > > > >> but
> > > > >> > > > looks
> > > > >> > > > > to me that's going to be next step.
> > > > >> > > > >
> > > > >> > > > > Had a crazy day, didn't have time to go through
all logs
> > > again,
> > > > >> plus
> > > > >> > > one
> > > > >> > > > > of the machines (last one where we had this
issue) is
> fully
> > > > >> > > reprovisioned
> > > > >> > > > > yesterday so I don't have logs from there
anymore.
> > > > >> > > > >
> > > > >> > > > > Beside upgrading,  what I will talk about
today, can you
> > just
> > > > >> point
> > > > >> > me
> > > > >> > > to
> > > > >> > > > > the specific RPC issue in 0.98.0? Thing is
that we have
> some
> > > > >> strange
> > > > >> > > > > moments with RPC in this case, and just want
to see if
> > that's
> > > > the
> > > > >> > same
> > > > >> > > > > thing (and we were even suspecting to RPC).
> > > > >> > > > >
> > > > >> > > > > Thanks a lot!
> > > > >> > > > > Dejan
> > > > >> > > > >
> > > > >> > > > > On Mon, Mar 16, 2015 at 9:32 PM, Andrew Purtell
<
> > > > >> apurtell@apache.org
> > > > >> > >
> > > > >> > > > > wrote:
> > > > >> > > > >
> > > > >> > > > >> Is there a particular reason why you
are using HBase
> > 0.98.0?
> > > > The
> > > > >> > > latest
> > > > >> > > > >> 0.98 release is 0.98.11. There's a known
performance
> issue
> > > with
> > > > >> > 0.98.0
> > > > >> > > > >> pertaining to RPC that was fixed in later
releases, you
> > > should
> > > > >> move
> > > > >> > up
> > > > >> > > > >> from
> > > > >> > > > >> 0.98.0. In addition hundreds of improvements
and bug
> fixes
> > > have
> > > > >> gone
> > > > >> > > > into
> > > > >> > > > >> the ten releases since 0.98.0.
> > > > >> > > > >>
> > > > >> > > > >> On Mon, Mar 16, 2015 at 6:40 AM, Dejan
Menges <
> > > > >> > dejan.menges@gmail.com
> > > > >> > > >
> > > > >> > > > >> wrote:
> > > > >> > > > >>
> > > > >> > > > >> > Hi All,
> > > > >> > > > >> >
> > > > >> > > > >> > We have a strange issue with HBase
performance (overall
> > > > cluster
> > > > >> > > > >> > performance) in case one of datanodes
in the cluster
> > > > >> unexpectedly
> > > > >> > > goes
> > > > >> > > > >> > down.
> > > > >> > > > >> >
> > > > >> > > > >> > So scenario is like follows:
> > > > >> > > > >> > - Cluster works fine, it's stable.
> > > > >> > > > >> > - One DataNode unexpectedly goes
down (PSU issue,
> network
> > > > >> issue,
> > > > >> > > > >> anything)
> > > > >> > > > >> > - Whole HBase cluster goes down
(performance becomes so
> > bad
> > > > >> that
> > > > >> > we
> > > > >> > > > >> have to
> > > > >> > > > >> > restart all RegionServers to get
it back to life).
> > > > >> > > > >> >
> > > > >> > > > >> > Most funny and latest issue that
happened was that we
> > added
> > > > new
> > > > >> > node
> > > > >> > > > to
> > > > >> > > > >> the
> > > > >> > > > >> > cluster (having 8 x 4T SATA disks)
and we left just
> > > DataNode
> > > > >> > running
> > > > >> > > > on
> > > > >> > > > >> it
> > > > >> > > > >> > to give it couple of days to get
some data. At some
> point
> > > in
> > > > >> time,
> > > > >> > > due
> > > > >> > > > >> to
> > > > >> > > > >> > hardware issue, server rebooted
(twice during three
> > hours)
> > > in
> > > > >> > moment
> > > > >> > > > >> when
> > > > >> > > > >> > it had maybe 5% of data it would
have in a couple of
> > days.
> > > > >> Nothing
> > > > >> > > > else
> > > > >> > > > >> > beside DataNode was running, and
once it went down, it
> > > > affected
> > > > >> > > > literary
> > > > >> > > > >> > everything, and restarting RegionServers
in the end
> fixed
> > > it.
> > > > >> > > > >> >
> > > > >> > > > >> > We are using HBase 0.98.0 with Hadoop
2.4.0
> > > > >> > > > >> >
> > > > >> > > > >>
> > > > >> > > > >>
> > > > >> > > > >>
> > > > >> > > > >> --
> > > > >> > > > >> Best regards,
> > > > >> > > > >>
> > > > >> > > > >>    - Andy
> > > > >> > > > >>
> > > > >> > > > >> Problems worthy of attack prove their
worth by hitting
> > back.
> > > -
> > > > >> Piet
> > > > >> > > Hein
> > > > >> > > > >> (via Tom White)
> > > > >> > > > >>
> > > > >> > > > >
> > > > >> > > > >
> > > > >> > > >
> > > > >> > >
> > > > >> >
> > > > >>
> > > > >
> > > >
> > >
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message