hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Nicolas Liochon <nkey...@gmail.com>
Subject Re: Strange issue when DataNode goes down
Date Mon, 23 Mar 2015 13:47:54 GMT
Thanks for the explanation. There is an issue if you modify this setting
however.
hbase tries to recover the lease (i.e. be sure that nobody is writing)
If you change hbase.lease.recovery.timeout hbase will start the recovery
(i.e. start to read) even if it's not sure that nobody's writing. That
means there is a dataloss risk.
Basically, you must not see this warning: WARN
org.apache.hadoop.hbase.util.FSHDFSUtils: Cannot recoverLease after trying
for[]

The recoverLease must succeed. The fact that it does not after X tries is
strange.
There may be a mistmatch between the hbase parameters and the hdfs ones.
You may need to have a look at the comments in FSHDFSUtils.java




On Mon, Mar 23, 2015 at 2:15 PM, Dejan Menges <dejan.menges@gmail.com>
wrote:

> I found the issue and fixed it, and will try to explain here what was
> exactly in our case in case someone else finds this interesting too.
>
> So initially, we had (couple of times) some network and hardware issues in
> our datacenters. When one server would die (literary die, we had some issue
> with PSUs) we saw issues with overall cluster performance on HBase side. As
> cluster is quite big and live, it was also quite hard to figure out exact
> root cause and how to fix the stuff we wanted to fix.
>
> So I set up another cluster, four nodes (with DataNode and RegionServer)
> and two other nodes with HMaster and Namenode in HA, using same stuff we
> use on production. We pumped some data into it, and I was able to reproduce
> same issue last week on it. What I tried to do is to cut one server (shut
> down it's interface) when all is good with cluster, when we have load, and
> see what's going to happen.
>
> On Friday, after Nicolas mentioned, I started taking a look in HBase logs
> on the node which was mentioned in HMaster log as the one taking over
> regions for the dead server. Basically what I was able to observe was 15
> minutes time (+- couple of seconds, what was also interesting, and will got
> later to that) between HMaster figures out that one of it's RegionServers
> is dead, and the time one of the mentioned nodes starts taking over those
> regions and they start appearing in HMaster's Web UI.
>
> I then set up everything like mentioned here
> http://hbase.apache.org/book.html#mttr - but still had exactly the same
> issues. Went over (again and again) all currently configured stuff, but
> still had the same issue.
>
> Then I started looking into HDFS. Opened NameNode UI, saw all is good, took
> one node down, was also looking RegionServer logs in the same time - and I
> also see that it took ~15 minutes for Namenode to elect dead node as dead.
> Somehow in the same moment regions started getting back to life. I remember
> in some older versions dfs timeout checks and check retries. Looked into
> defaults for our Hadoop version -
>
> http://hadoop.apache.org/docs/r2.4.1/hadoop-project-dist/hadoop-hdfs/hdfs-default.xml
> - and saw there that there's no recheck parameter anymore. Strange, as on
> StackOverflow I found post from month ago, for newer version than we use
> (we use 2.4.1, guy was using 2.6 - dfs.namenode.heartbeat.recheck-interval)
> I set it to 10 seconds as he mentioned, having checks every three seconds
> (default) and got DataNode elected as dead in ~45 seconds, as he mentioned.
> Not sure why this parameter is not documented, but obviously it works.
>
> Then figured out it still didn't fix our HBase failover issue. I was
> looking into HBase book again and again, and then saw this part:
>
> "How much time we allow elapse between calls to recover lease. Should be
> larger than the dfs timeout."
>
> This was description for hbase.lease.recovery.dfs.timeout. Wasn't sure from
> the comment what of all timeouts that's possible to set in Hadoop/HBase and
> that has something to do with DFS is this all about. But picked
> hbase.lease.recovery.timeout, and that was the catch.
>
> Initially, by default, hbase.lease.recovery.timeout is set to 15 minutes.
> Not sure why, and wasn't able to find yet why, but getting this down to one
> minute (what's more than OK for us) I was able to get rid of our issue. Not
> also sure why this is not mentioned in MTTR section in HBase book, as
> obviously MTTR doesn't work at all with this default timeout, at least it
> doesn't work the way we expected it to work.
>
> So thanks again for everyone being spammed with this, and specially thanks
> to Nicolas pointing me to the right direction.
>
> On Mon, Mar 23, 2015 at 1:37 PM Nicolas Liochon <nkeywal@gmail.com> wrote:
>
> > the attachments are rejected by the mailing list, can you put then on
> > pastebin?
> >
> > stale is mandatory (so it's good), but the issue here is just before. The
> > region server needs to read the file. In order to be sure that there is
> no
> > data loss, it needs to "recover the lease", that means ensuring that
> nobody
> > is writing the file. The regionserver calls the namenode to do this
> > recoverLease. So there should be some info in the namenode logs. You have
> > HDFS-4721 on your hdfs? The hbase book details (more or less...) this
> > recoverLease stuff.
> >
> >
> > On Mon, Mar 23, 2015 at 10:33 AM, Dejan Menges <dejan.menges@gmail.com>
> > wrote:
> >
> > > And also, just checked - dfs.namenode.avoid.read.stale.datanode and
> > > dfs.namenode.avoid.write.stale.datanode
> > > are both true, and dfs.namenode.stale.datanode.interval is set to
> > default
> > > 30000.
> > >
> > > On Mon, Mar 23, 2015 at 10:03 AM Dejan Menges <dejan.menges@gmail.com>
> > > wrote:
> > >
> > > > Hi Nicolas,
> > > >
> > > > Please find log attached.
> > > >
> > > > As I see it now more clearly, it was trying to recover HDFS WALs from
> > > node
> > > > that's dead:
> > > >
> > > > 2015-03-23 08:53:44,381 WARN
> org.apache.hadoop.hbase.util.FSHDFSUtils:
> > > > Cannot recoverLease after trying for 900000ms
> > > > (hbase.lease.recovery.timeout); continuing, but may be DATALOSS!!!;
> > > > attempt=40 on
> > > >
> > > file=hdfs://{my_hmaster_node}:8020/hbase/WALs/{node_i_intent
> > ionally_get_down_by_getting_network_down},60020,1426862900
> > 506-splitting/{node_i_intentionally_get_down_by_gett
> > ing_network_down}%2C60020%2C1426862900506.1427096924508
> > > > after 908210ms
> > > >
> > > > And as you can see from the log, it tried 40 times, what took it
> > exactly
> > > > 15 minutes.
> > > >
> > > > There's probably some parameter to tune to cut it of from 40 times /
> 15
> > > > minutes to something more useful, as for 15 minutes we don't have our
> > > > regions available, and HDFS have however replication factor 3.
> > > >
> > > > Googling, if I figure out what's this I will post it here. Will also
> > > > appreciate if someone knows how to cut this down.
> > > >
> > > > Thanks,
> > > >
> > > > Dejan
> > > >
> > > > On Fri, Mar 20, 2015 at 3:49 PM Nicolas Liochon <nkeywal@gmail.com>
> > > wrote:
> > > >
> > > >> The split is done by the region servers (the master coordinates).
Is
> > > there
> > > >> some interesting stuff in their logs?
> > > >>
> > > >> On Fri, Mar 20, 2015 at 3:38 PM, Dejan Menges <
> dejan.menges@gmail.com
> > >
> > > >> wrote:
> > > >>
> > > >> > With client issue was that it was retrying connecting to the
same
> > > region
> > > >> > servers even when they were reassigned. Lowering it down helped
in
> > > this
> > > >> > specific use case, but yes, we still want servers to reallocate
> > > quickly.
> > > >> >
> > > >> > What got me here:
> > > >> >
> > > >> > http://hbase.apache.org/book.html#mttr
> > > >> >
> > > >> > I basically set configuration exactly the same way as it's
> explained
> > > >> here.
> > > >> > *zookeeper.session.timeout* is (and was before) 60000 (one
> minute).
> > > >> >
> > > >> > So basically what happens is: - simulating network issues we
had
> > > >> recently).
> > > >> > - After short time I see in HBase that my RegionServer is dead,
> and
> > as
> > > >> > total number of regions I see previous total minus number of
> regions
> > > >> that
> > > >> > were hosted on the node hosting RegionServer that just
> > 'disappeared'.
> > > >> > - In this point I want my clus
> > > >> >
> > > >> > - I have test cluster consisting of four nodes, every node being
> > > >> DataNode
> > > >> > and RegionServer.
> > > >> > - I simulate network partition on one (connect to it through
> console
> > > and
> > > >> > take network interface downter to recover as soon as possible,
to
> > > start
> > > >> > serving missing regions.
> > > >> > - First thing I see in HMaster logs are:
> > > >> >
> > > >> > 2015-03-20 14:17:26,015 INFO
> > > >> > org.apache.hadoop.hbase.zookeeper.RegionServerTracker:
> RegionServer
> > > >> > ephemeral node deleted, processing expiration
> > > >> > [{name_of_node_I_took_down},60020,1426860403261]
> > > >> >
> > > >> > 2015-03-20 14:17:26,067 INFO
> > > >> > org.apache.hadoop.hbase.master.handler.ServerShutdownHandler:
> > > Splitting
> > > >> > logs for {name_of_node_I_took_down},60020,1426860403261 before
> > > >> assignment.
> > > >> >
> > > >> > 2015-03-20 14:17:26,105 INFO
> > > >> > org.apache.hadoop.hbase.master.SplitLogManager: dead splitlog
> > workers
> > > [
> > > >> > {name_of_node_I_took_down},60020,1426860403261]
> > > >> >
> > > >> > 2015-03-20 14:17:26,107 INFO
> > > >> > org.apache.hadoop.hbase.master.SplitLogManager: started splitting
> 1
> > > >> logs in
> > > >> > [hdfs://{fqdn_of_hmaster}:8020/hbase/WALs/{name_of_node_I_
> > took_down}
> > > >> > ,60020,1426860403261-splitting]
> > > >> >
> > > >> > 2015-03-20 14:17:26,150 INFO
> > > >> > org.apache.hadoop.hbase.master.SplitLogManager: task
> > > >> > /hbase/splitWAL/WALs%2F
> > > >> > {name_of_node_I_took_down}%2C60020%2C1426860403261-splitting%2F
> > > >> > {name_of_node_I_took_down}%252C60020%252C1426860403261.14268
> > 60404905
> > > >> > acquired by {fqdn_of_another_live_hnode},60020,1426859445623
> > > >> >
> > > >> > 2015-03-20 14:17:26,182 INFO
> > > >> > org.apache.hadoop.hbase.master.SplitLogManager: total tasks =
1
> > > >> unassigned
> > > >> > = 0 tasks={/hbase/splitWAL/WALs%2F{name_of_node_I_took_down}
> > > >> >
> > > >> > %2C60020%2C1426860403261-splitting%2F{name_of_node_I_took_
> > > >> down}%252C60020%252C1426860403261.1426860404905=last_update
> > > >> > = 1426861046182 last_version = 2 cur_worker_name =
> > > >> > {fqdn_of_another_live_node},60020,1426859445623 status =
> > in_progress
> > > >> > incarnation = 0 resubmits = 0 batch = installed = 1 done = 0
> error =
> > > 0}
> > > >> >
> > > >> > 2015-03-20 14:17:31,183 INFO
> > > >> > org.apache.hadoop.hbase.master.SplitLogManager: total tasks =
1
> > > >> unassigned
> > > >> > = 0 tasks={/hbase/splitWAL/WALs%2F{name_of_node_I_took_down}
> > > >> >
> > > >> > %2C60020%2C1426860403261-splitting%2F{name_of_node_I_took_
> > > >> down}%252C60020%252C1426860403261.1426860404905=last_update
> > > >> > = 1426861046182 last_version = 2 cur_worker_name =
> > > >> > {fqdn_of_another_live_node},60020,1426859445623 status =
> > in_progress
> > > >> > incarnation = 0 resubmits = 0 batch = installed = 1 done = 0
> error =
> > > 0}
> > > >> >
> > > >> > 2015-03-20 14:17:36,184 INFO
> > > >> > org.apache.hadoop.hbase.master.SplitLogManager: total tasks =
1
> > > >> unassigned
> > > >> > = 0 tasks={/hbase/splitWAL/WALs%2F{name_of_node_I_took_down}
> > > >> >
> > > >> > %2C60020%2C1426860403261-splitting%2F{name_of_node_I_took_
> > > >> down}%252C60020%252C1426860403261.1426860404905=last_update
> > > >> > = 1426861046182 last_version = 2 cur_worker_name =
> > > >> > {fqdn_of_another_live_node},60020,1426859445623 status =
> > in_progress
> > > >> > incarnation = 0 resubmits = 0 batch = installed = 1 done = 0
> error =
> > > 0}
> > > >> >
> > > >> > 2015-03-20 14:17:42,185 INFO
> > > >> > org.apache.hadoop.hbase.master.SplitLogManager: total tasks =
1
> > > >> unassigned
> > > >> > = 0 tasks={/hbase/splitWAL/WALs%2F{name_of_node_I_took_down}
> > > >> >
> > > >> > %2C60020%2C1426860403261-splitting%2F{name_of_node_I_took_
> > > >> down}%252C60020%252C1426860403261.1426860404905=last_update
> > > >> > = 1426861046182 last_version = 2 cur_worker_name =
> > > >> > {fqdn_of_another_live_node},60020,1426859445623 status =
> > in_progress
> > > >> > incarnation = 0 resubmits = 0 batch = installed = 1 done = 0
> error =
> > > 0}
> > > >> >
> > > >> > 2015-03-20 14:17:48,184 INFO
> > > >> > org.apache.hadoop.hbase.master.SplitLogManager: total tasks =
1
> > > >> unassigned
> > > >> > = 0 tasks={/hbase/splitWAL/WALs%2F{name_of_node_I_took_down}
> > > >> >
> > > >> > %2C60020%2C1426860403261-splitting%2F{name_of_node_I_took_
> > > >> down}%252C60020%252C1426860403261.1426860404905=last_update
> > > >> > = 1426861046182 last_version = 2 cur_worker_name =
> > > >> > {fqdn_of_another_live_node},60020,1426859445623 status =
> > in_progress
> > > >> > incarnation = 0 resubmits = 0 batch = installed = 1 done = 0
> error =
> > > 0}
> > > >> > In the meantime, In hbase...out log I got this:
> > > >> >
> > > >> > ==> hbase-hbase-master-{fqdn_of_my_hmaster_node}.out <==
> > > >> >
> > > >> > java.io.IOException: Call to
> > > >> > {name_of_node_I_took_down}/{ip_of_local_interface_I_took_dow
> > n}:60020
> > > >> > failed on local exception:
> > > >> > org.apache.hadoop.hbase.ipc.RpcClient$CallTimeoutException: Call
> > > >> > id=93152, waitTime=60044, rpcTimeout=60000
> > > >> >
> > > >> > at org.apache.hadoop.hbase.ipc.RpcClient.wrapException(RpcClien
> > > >> t.java:1532)
> > > >> >
> > > >> > at org.apache.hadoop.hbase.ipc.RpcClient.call(RpcClient.java:1502)
> > > >> >
> > > >> > at
> > > >> >
> > > >> > org.apache.hadoop.hbase.ipc.RpcClient.callBlockingMethod(Rpc
> > > >> Client.java:1684)
> > > >> >
> > > >> > at
> > > >> >
> > > >> > org.apache.hadoop.hbase.ipc.RpcClient$BlockingRpcChannelImpl
> > ementati
> > > >> on.callBlockingMethod(RpcClient.java:1737)
> > > >> >
> > > >> > at
> > > >> >
> > > >> > org.apache.hadoop.hbase.protobuf.generated.AdminProtos$
> > AdminService$
> > > >> BlockingStub.getRegionInfo(AdminProtos.java:20806)
> > > >> >
> > > >> > at
> > > >> >
> > > >> > org.apache.hadoop.hbase.client.HBaseAdmin.getCompactionState
> > > >> (HBaseAdmin.java:2524)
> > > >> >
> > > >> > at
> > > >> >
> > > >> > org.apache.hadoop.hbase.generated.master.table_jsp._jspServi
> > > >> ce(table_jsp.java:167)
> > > >> >
> > > >> > at org.apache.jasper.runtime.HttpJspBase.service(HttpJspBase.
> > java:98)
> > > >> >
> > > >> > at javax.servlet.http.HttpServlet.service(HttpServlet.java:770)
> > > >> >
> > > >> > at org.mortbay.jetty.servlet.ServletHolder.handle(ServletHolder
> > > >> .java:511)
> > > >> >
> > > >> > at
> > > >> >
> > > >> > org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilte
> > > >> r(ServletHandler.java:1221)
> > > >> >
> > > >> > at
> > > >> >
> > > >> > org.apache.hadoop.http.lib.StaticUserWebFilter$StaticUserFil
> > > >> ter.doFilter(StaticUserWebFilter.java:109)
> > > >> >
> > > >> > at
> > > >> >
> > > >> > org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilte
> > > >> r(ServletHandler.java:1212)
> > > >> >
> > > >> > at
> > > >> >
> > > >> > org.apache.hadoop.http.HttpServer$QuotingInputFilter.doFilte
> > > >> r(HttpServer.java:1081)
> > > >> >
> > > >> > at
> > > >> >
> > > >> > org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilte
> > > >> r(ServletHandler.java:1212)
> > > >> >
> > > >> > at
> > > org.apache.hadoop.http.NoCacheFilter.doFilter(NoCacheFilter.java:45)
> > > >> >
> > > >> > at
> > > >> >
> > > >> > org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilte
> > > >> r(ServletHandler.java:1212)
> > > >> >
> > > >> > at org.mortbay.jetty.servlet.ServletHandler.handle(ServletHandl
> > > >> er.java:399)
> > > >> >
> > > >> > at
> > > >> > org.mortbay.jetty.security.SecurityHandler.handle(SecurityHa
> > > >> ndler.java:216)
> > > >> >
> > > >> > at org.mortbay.jetty.servlet.SessionHandler.handle(SessionHandl
> > > >> er.java:182)
> > > >> >
> > > >> > at org.mortbay.jetty.handler.ContextHandler.handle(ContextHandl
> > > >> er.java:766)
> > > >> >
> > > >> > at org.mortbay.jetty.webapp.WebAppContext.handle(WebAppContext.
> > > >> java:450)
> > > >> >
> > > >> > at
> > > >> >
> > > >> > org.mortbay.jetty.handler.ContextHandlerCollection.handle(Co
> > > >> ntextHandlerCollection.java:230)
> > > >> >
> > > >> > at org.mortbay.jetty.handler.HandlerWrapper.handle(HandlerWrapp
> > > >> er.java:152)
> > > >> >
> > > >> > at org.mortbay.jetty.Server.handle(Server.java:326)
> > > >> >
> > > >> > at org.mortbay.jetty.HttpConnection.handleRequest(HttpConnectio
> > > >> n.java:542)
> > > >> >
> > > >> > at
> > > >> >
> > > >> > org.mortbay.jetty.HttpConnection$RequestHandler.headerComple
> > > >> te(HttpConnection.java:928)
> > > >> >
> > > >> > at org.mortbay.jetty.HttpParser.parseNext(HttpParser.java:549)
> > > >> >
> > > >> > at
> org.mortbay.jetty.HttpParser.parseAvailable(HttpParser.java:212)
> > > >> >
> > > >> > at
> org.mortbay.jetty.HttpConnection.handle(HttpConnection.java:404)
> > > >> >
> > > >> > at
> > > >> >
> > > >> > org.mortbay.io.nio.SelectChannelEndPoint.run(SelectChannelEn
> > > >> dPoint.java:410)
> > > >> >
> > > >> > at
> > > >> >
> > > >> > org.mortbay.thread.QueuedThreadPool$PoolThread.run(
> > > >> QueuedThreadPool.java:582)
> > > >> >
> > > >> > Caused by: org.apache.hadoop.hbase.ipc.RpcClient$
> > CallTimeoutException:
> > > >> Call
> > > >> > id=93152, waitTime=60044, rpcTimeout=60000
> > > >> >
> > > >> > at
> > > >> >
> > > >> > org.apache.hadoop.hbase.ipc.RpcClient$Connection.cleanupCall
> > > >> s(RpcClient.java:1234)
> > > >> >
> > > >> > at
> > > >> >
> > > >> > org.apache.hadoop.hbase.ipc.RpcClient$Connection.readRespons
> > > >> e(RpcClient.java:1171)
> > > >> >
> > > >> > at org.apache.hadoop.hbase.ipc.RpcClient$Connection.run(RpcClie
> > > >> nt.java:751)
> > > >> > Beside this same issue, please note that first message was at
> > > 2015-03-20
> > > >> > 14:17:26,015. And then (we got to the point when it started
> > > transition):
> > > >> >
> > > >> > 2015-03-20 14:32:35,059 INFO
> > > >> > org.apache.hadoop.hbase.master.SplitLogManager: task
> > > >> > /hbase/splitWAL/WALs%2F
> > > >> > {name_of_node_I_took_down}%2C60020%2C1426860403261-splitting%2F
> > > >> > {name_of_node_I_took_down}%252C60020%252C1426860403261.14268
> > 60404905
> > > >> > entered state: DONE {fqdn_of_new_live_node},60020,1426859445623
> > > >> >
> > > >> > 2015-03-20 14:32:35,109 INFO
> > > >> > org.apache.hadoop.hbase.master.SplitLogManager: Done splitting
> > > >> > /hbase/splitWAL/WALs%2F{name_of_node_I_took_down}
> > > >> > %2C60020%2C1426860403261-splitting%2F{name_of_node_I_took_down}
> > > >> > %252C60020%252C1426860403261.1426860404905
> > > >> >
> > > >> > 2015-03-20 14:32:35,190 INFO
> > > >> > org.apache.hadoop.hbase.master.SplitLogManager: finished splitting
> > > >> (more
> > > >> > than or equal to) 9 bytes in 1 log files in
> > > >> >
> > > >> > [hdfs://{fqdn_of_my_hmaster_node}:8020/hbase/WALs/{name_of_
> > > >> node_I_took_down},60020,1426860403261-splitting]
> > > >> > in 909083ms
> > > >> >
> > > >> > 2015-03-20 14:32:35,191 INFO org.apache.hadoop.hbase.master
> > > >> .RegionStates:
> > > >> > Transitioned {0e7cc87a4ef6c47a779186f5bf79a01c state=OPEN,
> > > >> > ts=1426860639088,
> > > server={name_of_node_I_took_down},60020,1426860403261}
> > > >> to
> > > >> > {0e7cc87a4ef6c47a779186f5bf79a01c state=OFFLINE, ts=1426861955191,
> > > >> server=
> > > >> > {name_of_node_I_took_down},60020,1426860403261}
> > > >> >
> > > >> > 2015-03-20 14:32:35,191 INFO org.apache.hadoop.hbase.master
> > > >> .RegionStates:
> > > >> > Offlined 0e7cc87a4ef6c47a779186f5bf79a01c from
> > > >> {name_of_node_I_took_down}
> > > >> > ,60020,1426860403261
> > > >> >
> > > >> > 2015-03-20 14:32:35,191 INFO org.apache.hadoop.hbase.master
> > > >> .RegionStates:
> > > >> > Transitioned {25ab6e7b42e36ddaa723d71be5954543 state=OPEN,
> > > >> > ts=1426860641783,
> > > server={name_of_node_I_took_down},60020,1426860403261}
> > > >> to
> > > >> > {25ab6e7b42e36ddaa723d71be5954543 state=OFFLINE, ts=1426861955191,
> > > >> server=
> > > >> > {name_of_node_I_took_down},60020,1426860403261}
> > > >> >
> > > >> > 2015-03-20 14:32:35,191 INFO org.apache.hadoop.hbase.master
> > > >> .RegionStates:
> > > >> > Offlined 25ab6e7b42e36ddaa723d71be5954543 from
> > > >> {name_of_node_I_took_down}
> > > >> > ,60020,1426860403261
> > > >> > At this point, note that it finished SplitLogManager task at
> > 14:32:35
> > > >> and
> > > >> > started transitioning just after that. So this is 15 minutes
that
> > I'm
> > > >> > talking about.
> > > >> >
> > > >> > What am I missing?
> > > >> >
> > > >> >
> > > >> > On Fri, Mar 20, 2015 at 2:37 PM Nicolas Liochon <
> nkeywal@gmail.com>
> > > >> wrote:
> > > >> >
> > > >> > > You've changed the value of hbase.zookeeper.timeout to 15
> > minutes? A
> > > >> very
> > > >> > > reasonable target is 1 minute before relocating the regions.
> > That's
> > > >> the
> > > >> > > default iirc. You can push it to 20s, but then
> > gc-stopping-the-world
> > > >> > > becomes more of an issue. 15 minutes is really a lot. The
hdfs
> > stale
> > > >> mode
> > > >> > > must always be used, with a lower timeout than the hbase
one.
> > > >> > >
> > > >> > > Client side there should be nothing to do (excepted for
advanced
> > > >> stuff);
> > > >> > at
> > > >> > > each retry the client checks the location of the regions.
If you
> > > lower
> > > >> > the
> > > >> > > number of retry the client will fail sooner, but usually
you
> don't
> > > >> want
> > > >> > the
> > > >> > > client to fail, you want the servers to reallocate quickly.
> > > >> > >
> > > >> > > On Fri, Mar 20, 2015 at 1:36 PM, Dejan Menges <
> > > dejan.menges@gmail.com
> > > >> >
> > > >> > > wrote:
> > > >> > >
> > > >> > > > Hi,
> > > >> > > >
> > > >> > > > Sorry for little bit late update, but managed to narrow
it
> > little
> > > >> bit
> > > >> > > down.
> > > >> > > >
> > > >> > > > We didn't update yet, as we are using Hortonworks distribution
> > > right
> > > >> > now,
> > > >> > > > and even if we update we will get 0.98.4. However,
looks that
> > > issue
> > > >> > here
> > > >> > > > was in our use case and configuration (still looking
into it).
> > > >> > > >
> > > >> > > > Basically, initially I saw that when one server goes
down, we
> > > start
> > > >> > > having
> > > >> > > > performance issues in general, but it managed to be
on our
> > client
> > > >> side,
> > > >> > > due
> > > >> > > > to caching, and clients were trying to reconnect to
nodes that
> > > were
> > > >> > > offline
> > > >> > > > and later trying to get regions from those nodes too.
This is
> > > >> basically
> > > >> > > why
> > > >> > > > on server side I didn't manage to see anything in logs
that
> > would
> > > >> be at
> > > >> > > > least little bit interesting and point me into desired
> > direction.
> > > >> > > >
> > > >> > > > Another question that popped up to me is - in case
server is
> > down
> > > >> (and
> > > >> > > with
> > > >> > > > it DataNode and HRegionServer it was hosting) - what's
optimal
> > > time
> > > >> to
> > > >> > > set
> > > >> > > > for HMaster to consider server dead reassign regions
somewhere
> > > >> else, as
> > > >> > > > this is another performance bottleneck we hit during
inability
> > to
> > > >> > access
> > > >> > > > regions? In our case it's configured to 15 minutes,
and simple
> > > logic
> > > >> > > tells
> > > >> > > > me if you want it earlier then configure lower number
of
> > retries,
> > > >> but
> > > >> > > issue
> > > >> > > > is as always in details, so not sure if anyone knows
some
> better
> > > >> math
> > > >> > for
> > > >> > > > this?
> > > >> > > >
> > > >> > > > And last question - is it possible to manually force
HBase to
> > > >> reassign
> > > >> > > > regions? In this case, while HMaster is retrying to
contact
> node
> > > >> that's
> > > >> > > > dead, it's impossible to force it using 'balancer'
command.
> > > >> > > >
> > > >> > > > Thanks a lot!
> > > >> > > >
> > > >> > > > Dejan
> > > >> > > >
> > > >> > > > On Tue, Mar 17, 2015 at 9:37 AM Dejan Menges <
> > > >> dejan.menges@gmail.com>
> > > >> > > > wrote:
> > > >> > > >
> > > >> > > > > Hi,
> > > >> > > > >
> > > >> > > > > To be very honest - there's no particular reason
why we
> stick
> > to
> > > >> this
> > > >> > > > one,
> > > >> > > > > beside just lack of time currently to go through
upgrade
> > > process,
> > > >> but
> > > >> > > > looks
> > > >> > > > > to me that's going to be next step.
> > > >> > > > >
> > > >> > > > > Had a crazy day, didn't have time to go through
all logs
> > again,
> > > >> plus
> > > >> > > one
> > > >> > > > > of the machines (last one where we had this issue)
is fully
> > > >> > > reprovisioned
> > > >> > > > > yesterday so I don't have logs from there anymore.
> > > >> > > > >
> > > >> > > > > Beside upgrading,  what I will talk about today,
can you
> just
> > > >> point
> > > >> > me
> > > >> > > to
> > > >> > > > > the specific RPC issue in 0.98.0? Thing is that
we have some
> > > >> strange
> > > >> > > > > moments with RPC in this case, and just want to
see if
> that's
> > > the
> > > >> > same
> > > >> > > > > thing (and we were even suspecting to RPC).
> > > >> > > > >
> > > >> > > > > Thanks a lot!
> > > >> > > > > Dejan
> > > >> > > > >
> > > >> > > > > On Mon, Mar 16, 2015 at 9:32 PM, Andrew Purtell
<
> > > >> apurtell@apache.org
> > > >> > >
> > > >> > > > > wrote:
> > > >> > > > >
> > > >> > > > >> Is there a particular reason why you are using
HBase
> 0.98.0?
> > > The
> > > >> > > latest
> > > >> > > > >> 0.98 release is 0.98.11. There's a known performance
issue
> > with
> > > >> > 0.98.0
> > > >> > > > >> pertaining to RPC that was fixed in later
releases, you
> > should
> > > >> move
> > > >> > up
> > > >> > > > >> from
> > > >> > > > >> 0.98.0. In addition hundreds of improvements
and bug fixes
> > have
> > > >> gone
> > > >> > > > into
> > > >> > > > >> the ten releases since 0.98.0.
> > > >> > > > >>
> > > >> > > > >> On Mon, Mar 16, 2015 at 6:40 AM, Dejan Menges
<
> > > >> > dejan.menges@gmail.com
> > > >> > > >
> > > >> > > > >> wrote:
> > > >> > > > >>
> > > >> > > > >> > Hi All,
> > > >> > > > >> >
> > > >> > > > >> > We have a strange issue with HBase performance
(overall
> > > cluster
> > > >> > > > >> > performance) in case one of datanodes
in the cluster
> > > >> unexpectedly
> > > >> > > goes
> > > >> > > > >> > down.
> > > >> > > > >> >
> > > >> > > > >> > So scenario is like follows:
> > > >> > > > >> > - Cluster works fine, it's stable.
> > > >> > > > >> > - One DataNode unexpectedly goes down
(PSU issue, network
> > > >> issue,
> > > >> > > > >> anything)
> > > >> > > > >> > - Whole HBase cluster goes down (performance
becomes so
> bad
> > > >> that
> > > >> > we
> > > >> > > > >> have to
> > > >> > > > >> > restart all RegionServers to get it back
to life).
> > > >> > > > >> >
> > > >> > > > >> > Most funny and latest issue that happened
was that we
> added
> > > new
> > > >> > node
> > > >> > > > to
> > > >> > > > >> the
> > > >> > > > >> > cluster (having 8 x 4T SATA disks) and
we left just
> > DataNode
> > > >> > running
> > > >> > > > on
> > > >> > > > >> it
> > > >> > > > >> > to give it couple of days to get some
data. At some point
> > in
> > > >> time,
> > > >> > > due
> > > >> > > > >> to
> > > >> > > > >> > hardware issue, server rebooted (twice
during three
> hours)
> > in
> > > >> > moment
> > > >> > > > >> when
> > > >> > > > >> > it had maybe 5% of data it would have
in a couple of
> days.
> > > >> Nothing
> > > >> > > > else
> > > >> > > > >> > beside DataNode was running, and once
it went down, it
> > > affected
> > > >> > > > literary
> > > >> > > > >> > everything, and restarting RegionServers
in the end fixed
> > it.
> > > >> > > > >> >
> > > >> > > > >> > We are using HBase 0.98.0 with Hadoop
2.4.0
> > > >> > > > >> >
> > > >> > > > >>
> > > >> > > > >>
> > > >> > > > >>
> > > >> > > > >> --
> > > >> > > > >> Best regards,
> > > >> > > > >>
> > > >> > > > >>    - Andy
> > > >> > > > >>
> > > >> > > > >> Problems worthy of attack prove their worth
by hitting
> back.
> > -
> > > >> Piet
> > > >> > > Hein
> > > >> > > > >> (via Tom White)
> > > >> > > > >>
> > > >> > > > >
> > > >> > > > >
> > > >> > > >
> > > >> > >
> > > >> >
> > > >>
> > > >
> > >
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message