Return-Path: X-Original-To: apmail-hbase-user-archive@www.apache.org Delivered-To: apmail-hbase-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 18F8517A90 for ; Fri, 20 Mar 2015 14:46:39 +0000 (UTC) Received: (qmail 74031 invoked by uid 500); 20 Mar 2015 14:39:18 -0000 Delivered-To: apmail-hbase-user-archive@hbase.apache.org Received: (qmail 73968 invoked by uid 500); 20 Mar 2015 14:39:18 -0000 Mailing-List: contact user-help@hbase.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@hbase.apache.org Delivered-To: mailing list user@hbase.apache.org Received: (qmail 73952 invoked by uid 99); 20 Mar 2015 14:39:17 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 20 Mar 2015 14:39:17 +0000 X-ASF-Spam-Status: No, hits=1.5 required=5.0 tests=HTML_MESSAGE,RCVD_IN_DNSWL_LOW,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (nike.apache.org: domain of dejan.menges@gmail.com designates 209.85.215.48 as permitted sender) Received: from [209.85.215.48] (HELO mail-la0-f48.google.com) (209.85.215.48) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 20 Mar 2015 14:38:50 +0000 Received: by labjg1 with SMTP id jg1so88639714lab.2 for ; Fri, 20 Mar 2015 07:38:49 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:references:in-reply-to:from:date:message-id:subject:to :content-type; bh=ChOmplPm5deoJDAuZlmBcBTu++ozFQwW7ej3CPs2W8Q=; b=UHO7IZEmcZ13L6UUDjIW7McFbEvxtzLKrwLKAx7tufS4XcRY3C1X4/cZ/q75ucKecq Q9blzIEgdidndEwmAKhKKhYncU7IJ6d0MECVARRLzF2zboexvkF25ej609+MjIYNF6+P wnA2DgScxXmUdjugGjzCMbXqBFnx+UK0SzKTVJ6I6AVNajWNFiZSIEQSnFdaf1k6PgbT /6z+mOKSeirXyvfoTBepBF62Ef/vCHc551tAs984JrpLlCPJy/+LI86hUEVYnSkiLdZ7 jVT2t6PA6Wpj8kjjq6C+pNa3hvzxG++l33LVFHCOwPePnEjY/x3b7Zrx19kz5bPq5Oc4 HcBg== X-Received: by 10.152.179.42 with SMTP id dd10mr42819651lac.122.1426862329154; Fri, 20 Mar 2015 07:38:49 -0700 (PDT) MIME-Version: 1.0 References: In-Reply-To: From: Dejan Menges Date: Fri, 20 Mar 2015 14:38:48 +0000 Message-ID: Subject: Re: Strange issue when DataNode goes down To: user Content-Type: multipart/alternative; boundary=001a113433f845d13c0511b946b4 X-Virus-Checked: Checked by ClamAV on apache.org --001a113433f845d13c0511b946b4 Content-Type: text/plain; charset=UTF-8 With client issue was that it was retrying connecting to the same region servers even when they were reassigned. Lowering it down helped in this specific use case, but yes, we still want servers to reallocate quickly. What got me here: http://hbase.apache.org/book.html#mttr I basically set configuration exactly the same way as it's explained here. *zookeeper.session.timeout* is (and was before) 60000 (one minute). So basically what happens is: - I have test cluster consisting of four nodes, every node being DataNode and RegionServer. - I simulate network partition on one (connect to it through console and take network interface down - simulating network issues we had recently). - After short time I see in HBase that my RegionServer is dead, and as total number of regions I see previous total minus number of regions that were hosted on the node hosting RegionServer that just 'disappeared'. - In this point I want my cluster to recover as soon as possible, to start serving missing regions. - First thing I see in HMaster logs are: 2015-03-20 14:17:26,015 INFO org.apache.hadoop.hbase.zookeeper.RegionServerTracker: RegionServer ephemeral node deleted, processing expiration [{name_of_node_I_took_down},60020,1426860403261] 2015-03-20 14:17:26,067 INFO org.apache.hadoop.hbase.master.handler.ServerShutdownHandler: Splitting logs for {name_of_node_I_took_down},60020,1426860403261 before assignment. 2015-03-20 14:17:26,105 INFO org.apache.hadoop.hbase.master.SplitLogManager: dead splitlog workers [ {name_of_node_I_took_down},60020,1426860403261] 2015-03-20 14:17:26,107 INFO org.apache.hadoop.hbase.master.SplitLogManager: started splitting 1 logs in [hdfs://{fqdn_of_hmaster}:8020/hbase/WALs/{name_of_node_I_took_down} ,60020,1426860403261-splitting] 2015-03-20 14:17:26,150 INFO org.apache.hadoop.hbase.master.SplitLogManager: task /hbase/splitWAL/WALs%2F {name_of_node_I_took_down}%2C60020%2C1426860403261-splitting%2F {name_of_node_I_took_down}%252C60020%252C1426860403261.1426860404905 acquired by {fqdn_of_another_live_hnode},60020,1426859445623 2015-03-20 14:17:26,182 INFO org.apache.hadoop.hbase.master.SplitLogManager: total tasks = 1 unassigned = 0 tasks={/hbase/splitWAL/WALs%2F{name_of_node_I_took_down} %2C60020%2C1426860403261-splitting%2F{name_of_node_I_took_down}%252C60020%252C1426860403261.1426860404905=last_update = 1426861046182 last_version = 2 cur_worker_name = {fqdn_of_another_live_node},60020,1426859445623 status = in_progress incarnation = 0 resubmits = 0 batch = installed = 1 done = 0 error = 0} 2015-03-20 14:17:31,183 INFO org.apache.hadoop.hbase.master.SplitLogManager: total tasks = 1 unassigned = 0 tasks={/hbase/splitWAL/WALs%2F{name_of_node_I_took_down} %2C60020%2C1426860403261-splitting%2F{name_of_node_I_took_down}%252C60020%252C1426860403261.1426860404905=last_update = 1426861046182 last_version = 2 cur_worker_name = {fqdn_of_another_live_node},60020,1426859445623 status = in_progress incarnation = 0 resubmits = 0 batch = installed = 1 done = 0 error = 0} 2015-03-20 14:17:36,184 INFO org.apache.hadoop.hbase.master.SplitLogManager: total tasks = 1 unassigned = 0 tasks={/hbase/splitWAL/WALs%2F{name_of_node_I_took_down} %2C60020%2C1426860403261-splitting%2F{name_of_node_I_took_down}%252C60020%252C1426860403261.1426860404905=last_update = 1426861046182 last_version = 2 cur_worker_name = {fqdn_of_another_live_node},60020,1426859445623 status = in_progress incarnation = 0 resubmits = 0 batch = installed = 1 done = 0 error = 0} 2015-03-20 14:17:42,185 INFO org.apache.hadoop.hbase.master.SplitLogManager: total tasks = 1 unassigned = 0 tasks={/hbase/splitWAL/WALs%2F{name_of_node_I_took_down} %2C60020%2C1426860403261-splitting%2F{name_of_node_I_took_down}%252C60020%252C1426860403261.1426860404905=last_update = 1426861046182 last_version = 2 cur_worker_name = {fqdn_of_another_live_node},60020,1426859445623 status = in_progress incarnation = 0 resubmits = 0 batch = installed = 1 done = 0 error = 0} 2015-03-20 14:17:48,184 INFO org.apache.hadoop.hbase.master.SplitLogManager: total tasks = 1 unassigned = 0 tasks={/hbase/splitWAL/WALs%2F{name_of_node_I_took_down} %2C60020%2C1426860403261-splitting%2F{name_of_node_I_took_down}%252C60020%252C1426860403261.1426860404905=last_update = 1426861046182 last_version = 2 cur_worker_name = {fqdn_of_another_live_node},60020,1426859445623 status = in_progress incarnation = 0 resubmits = 0 batch = installed = 1 done = 0 error = 0} In the meantime, In hbase...out log I got this: ==> hbase-hbase-master-{fqdn_of_my_hmaster_node}.out <== java.io.IOException: Call to {name_of_node_I_took_down}/{ip_of_local_interface_I_took_down}:60020 failed on local exception: org.apache.hadoop.hbase.ipc.RpcClient$CallTimeoutException: Call id=93152, waitTime=60044, rpcTimeout=60000 at org.apache.hadoop.hbase.ipc.RpcClient.wrapException(RpcClient.java:1532) at org.apache.hadoop.hbase.ipc.RpcClient.call(RpcClient.java:1502) at org.apache.hadoop.hbase.ipc.RpcClient.callBlockingMethod(RpcClient.java:1684) at org.apache.hadoop.hbase.ipc.RpcClient$BlockingRpcChannelImplementation.callBlockingMethod(RpcClient.java:1737) at org.apache.hadoop.hbase.protobuf.generated.AdminProtos$AdminService$BlockingStub.getRegionInfo(AdminProtos.java:20806) at org.apache.hadoop.hbase.client.HBaseAdmin.getCompactionState(HBaseAdmin.java:2524) at org.apache.hadoop.hbase.generated.master.table_jsp._jspService(table_jsp.java:167) at org.apache.jasper.runtime.HttpJspBase.service(HttpJspBase.java:98) at javax.servlet.http.HttpServlet.service(HttpServlet.java:770) at org.mortbay.jetty.servlet.ServletHolder.handle(ServletHolder.java:511) at org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1221) at org.apache.hadoop.http.lib.StaticUserWebFilter$StaticUserFilter.doFilter(StaticUserWebFilter.java:109) at org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1212) at org.apache.hadoop.http.HttpServer$QuotingInputFilter.doFilter(HttpServer.java:1081) at org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1212) at org.apache.hadoop.http.NoCacheFilter.doFilter(NoCacheFilter.java:45) at org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1212) at org.mortbay.jetty.servlet.ServletHandler.handle(ServletHandler.java:399) at org.mortbay.jetty.security.SecurityHandler.handle(SecurityHandler.java:216) at org.mortbay.jetty.servlet.SessionHandler.handle(SessionHandler.java:182) at org.mortbay.jetty.handler.ContextHandler.handle(ContextHandler.java:766) at org.mortbay.jetty.webapp.WebAppContext.handle(WebAppContext.java:450) at org.mortbay.jetty.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:230) at org.mortbay.jetty.handler.HandlerWrapper.handle(HandlerWrapper.java:152) at org.mortbay.jetty.Server.handle(Server.java:326) at org.mortbay.jetty.HttpConnection.handleRequest(HttpConnection.java:542) at org.mortbay.jetty.HttpConnection$RequestHandler.headerComplete(HttpConnection.java:928) at org.mortbay.jetty.HttpParser.parseNext(HttpParser.java:549) at org.mortbay.jetty.HttpParser.parseAvailable(HttpParser.java:212) at org.mortbay.jetty.HttpConnection.handle(HttpConnection.java:404) at org.mortbay.io.nio.SelectChannelEndPoint.run(SelectChannelEndPoint.java:410) at org.mortbay.thread.QueuedThreadPool$PoolThread.run(QueuedThreadPool.java:582) Caused by: org.apache.hadoop.hbase.ipc.RpcClient$CallTimeoutException: Call id=93152, waitTime=60044, rpcTimeout=60000 at org.apache.hadoop.hbase.ipc.RpcClient$Connection.cleanupCalls(RpcClient.java:1234) at org.apache.hadoop.hbase.ipc.RpcClient$Connection.readResponse(RpcClient.java:1171) at org.apache.hadoop.hbase.ipc.RpcClient$Connection.run(RpcClient.java:751) Beside this same issue, please note that first message was at 2015-03-20 14:17:26,015. And then (we got to the point when it started transition): 2015-03-20 14:32:35,059 INFO org.apache.hadoop.hbase.master.SplitLogManager: task /hbase/splitWAL/WALs%2F {name_of_node_I_took_down}%2C60020%2C1426860403261-splitting%2F {name_of_node_I_took_down}%252C60020%252C1426860403261.1426860404905 entered state: DONE {fqdn_of_new_live_node},60020,1426859445623 2015-03-20 14:32:35,109 INFO org.apache.hadoop.hbase.master.SplitLogManager: Done splitting /hbase/splitWAL/WALs%2F{name_of_node_I_took_down} %2C60020%2C1426860403261-splitting%2F{name_of_node_I_took_down} %252C60020%252C1426860403261.1426860404905 2015-03-20 14:32:35,190 INFO org.apache.hadoop.hbase.master.SplitLogManager: finished splitting (more than or equal to) 9 bytes in 1 log files in [hdfs://{fqdn_of_my_hmaster_node}:8020/hbase/WALs/{name_of_node_I_took_down},60020,1426860403261-splitting] in 909083ms 2015-03-20 14:32:35,191 INFO org.apache.hadoop.hbase.master.RegionStates: Transitioned {0e7cc87a4ef6c47a779186f5bf79a01c state=OPEN, ts=1426860639088, server={name_of_node_I_took_down},60020,1426860403261} to {0e7cc87a4ef6c47a779186f5bf79a01c state=OFFLINE, ts=1426861955191, server= {name_of_node_I_took_down},60020,1426860403261} 2015-03-20 14:32:35,191 INFO org.apache.hadoop.hbase.master.RegionStates: Offlined 0e7cc87a4ef6c47a779186f5bf79a01c from {name_of_node_I_took_down} ,60020,1426860403261 2015-03-20 14:32:35,191 INFO org.apache.hadoop.hbase.master.RegionStates: Transitioned {25ab6e7b42e36ddaa723d71be5954543 state=OPEN, ts=1426860641783, server={name_of_node_I_took_down},60020,1426860403261} to {25ab6e7b42e36ddaa723d71be5954543 state=OFFLINE, ts=1426861955191, server= {name_of_node_I_took_down},60020,1426860403261} 2015-03-20 14:32:35,191 INFO org.apache.hadoop.hbase.master.RegionStates: Offlined 25ab6e7b42e36ddaa723d71be5954543 from {name_of_node_I_took_down} ,60020,1426860403261 At this point, note that it finished SplitLogManager task at 14:32:35 and started transitioning just after that. So this is 15 minutes that I'm talking about. What am I missing? On Fri, Mar 20, 2015 at 2:37 PM Nicolas Liochon wrote: > You've changed the value of hbase.zookeeper.timeout to 15 minutes? A very > reasonable target is 1 minute before relocating the regions. That's the > default iirc. You can push it to 20s, but then gc-stopping-the-world > becomes more of an issue. 15 minutes is really a lot. The hdfs stale mode > must always be used, with a lower timeout than the hbase one. > > Client side there should be nothing to do (excepted for advanced stuff); at > each retry the client checks the location of the regions. If you lower the > number of retry the client will fail sooner, but usually you don't want the > client to fail, you want the servers to reallocate quickly. > > On Fri, Mar 20, 2015 at 1:36 PM, Dejan Menges > wrote: > > > Hi, > > > > Sorry for little bit late update, but managed to narrow it little bit > down. > > > > We didn't update yet, as we are using Hortonworks distribution right now, > > and even if we update we will get 0.98.4. However, looks that issue here > > was in our use case and configuration (still looking into it). > > > > Basically, initially I saw that when one server goes down, we start > having > > performance issues in general, but it managed to be on our client side, > due > > to caching, and clients were trying to reconnect to nodes that were > offline > > and later trying to get regions from those nodes too. This is basically > why > > on server side I didn't manage to see anything in logs that would be at > > least little bit interesting and point me into desired direction. > > > > Another question that popped up to me is - in case server is down (and > with > > it DataNode and HRegionServer it was hosting) - what's optimal time to > set > > for HMaster to consider server dead reassign regions somewhere else, as > > this is another performance bottleneck we hit during inability to access > > regions? In our case it's configured to 15 minutes, and simple logic > tells > > me if you want it earlier then configure lower number of retries, but > issue > > is as always in details, so not sure if anyone knows some better math for > > this? > > > > And last question - is it possible to manually force HBase to reassign > > regions? In this case, while HMaster is retrying to contact node that's > > dead, it's impossible to force it using 'balancer' command. > > > > Thanks a lot! > > > > Dejan > > > > On Tue, Mar 17, 2015 at 9:37 AM Dejan Menges > > wrote: > > > > > Hi, > > > > > > To be very honest - there's no particular reason why we stick to this > > one, > > > beside just lack of time currently to go through upgrade process, but > > looks > > > to me that's going to be next step. > > > > > > Had a crazy day, didn't have time to go through all logs again, plus > one > > > of the machines (last one where we had this issue) is fully > reprovisioned > > > yesterday so I don't have logs from there anymore. > > > > > > Beside upgrading, what I will talk about today, can you just point me > to > > > the specific RPC issue in 0.98.0? Thing is that we have some strange > > > moments with RPC in this case, and just want to see if that's the same > > > thing (and we were even suspecting to RPC). > > > > > > Thanks a lot! > > > Dejan > > > > > > On Mon, Mar 16, 2015 at 9:32 PM, Andrew Purtell > > > wrote: > > > > > >> Is there a particular reason why you are using HBase 0.98.0? The > latest > > >> 0.98 release is 0.98.11. There's a known performance issue with 0.98.0 > > >> pertaining to RPC that was fixed in later releases, you should move up > > >> from > > >> 0.98.0. In addition hundreds of improvements and bug fixes have gone > > into > > >> the ten releases since 0.98.0. > > >> > > >> On Mon, Mar 16, 2015 at 6:40 AM, Dejan Menges > > > >> wrote: > > >> > > >> > Hi All, > > >> > > > >> > We have a strange issue with HBase performance (overall cluster > > >> > performance) in case one of datanodes in the cluster unexpectedly > goes > > >> > down. > > >> > > > >> > So scenario is like follows: > > >> > - Cluster works fine, it's stable. > > >> > - One DataNode unexpectedly goes down (PSU issue, network issue, > > >> anything) > > >> > - Whole HBase cluster goes down (performance becomes so bad that we > > >> have to > > >> > restart all RegionServers to get it back to life). > > >> > > > >> > Most funny and latest issue that happened was that we added new node > > to > > >> the > > >> > cluster (having 8 x 4T SATA disks) and we left just DataNode running > > on > > >> it > > >> > to give it couple of days to get some data. At some point in time, > due > > >> to > > >> > hardware issue, server rebooted (twice during three hours) in moment > > >> when > > >> > it had maybe 5% of data it would have in a couple of days. Nothing > > else > > >> > beside DataNode was running, and once it went down, it affected > > literary > > >> > everything, and restarting RegionServers in the end fixed it. > > >> > > > >> > We are using HBase 0.98.0 with Hadoop 2.4.0 > > >> > > > >> > > >> > > >> > > >> -- > > >> Best regards, > > >> > > >> - Andy > > >> > > >> Problems worthy of attack prove their worth by hitting back. - Piet > Hein > > >> (via Tom White) > > >> > > > > > > > > > --001a113433f845d13c0511b946b4--