Return-Path: X-Original-To: apmail-hbase-user-archive@www.apache.org Delivered-To: apmail-hbase-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id BE15F175C1 for ; Fri, 20 Mar 2015 12:44:42 +0000 (UTC) Received: (qmail 66583 invoked by uid 500); 20 Mar 2015 12:37:17 -0000 Delivered-To: apmail-hbase-user-archive@hbase.apache.org Received: (qmail 66509 invoked by uid 500); 20 Mar 2015 12:37:17 -0000 Mailing-List: contact user-help@hbase.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@hbase.apache.org Delivered-To: mailing list user@hbase.apache.org Received: (qmail 66497 invoked by uid 99); 20 Mar 2015 12:37:16 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 20 Mar 2015 12:37:16 +0000 X-ASF-Spam-Status: No, hits=1.5 required=5.0 tests=HTML_MESSAGE,RCVD_IN_DNSWL_LOW,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (nike.apache.org: domain of dejan.menges@gmail.com designates 209.85.217.182 as permitted sender) Received: from [209.85.217.182] (HELO mail-lb0-f182.google.com) (209.85.217.182) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 20 Mar 2015 12:36:50 +0000 Received: by lbblx11 with SMTP id lx11so52094133lbb.3 for ; Fri, 20 Mar 2015 05:36:04 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:references:in-reply-to:from:date:message-id:subject:to :content-type; bh=brVS4GKR1VZ3vpMsM/Gq+SGswUlMZpXF6suz//pXvZk=; b=CN+FKHXlNyhtLk973Qlif4MCLgjhAD/Mwub4ftESqHSUx3POxw93hb1wC4C4kXqzUM lqNV+p5dR4npJ1hHV/2TADkxxO2ny0ffe5DrGlkjylhFWnZvqmzhHznRJkaciByLNL/s 7U5vcahwh3oBx1NUkhv6CwtWHdacPm7+GrjxR+NIpVIT6H90b6YXLIjeCpc8bn/Ia2bY JlvtPjbHtZuZfw0dMiDLHz8v3yAiiMkTFd+8Yi4drHZ7o6HnxzMu5ou0RWosd2jU1DEl MVDBy3FOPe5YHarY/7oY2RMKoXXJHS1K/WNPQnfmUbS+P+0ywLkEImyn3KJ/E8IX5Y0T jt4w== X-Received: by 10.152.116.43 with SMTP id jt11mr71170960lab.30.1426854963991; Fri, 20 Mar 2015 05:36:03 -0700 (PDT) MIME-Version: 1.0 References: In-Reply-To: From: Dejan Menges Date: Fri, 20 Mar 2015 12:36:03 +0000 Message-ID: Subject: Re: Strange issue when DataNode goes down To: "user@hbase.apache.org" Content-Type: multipart/alternative; boundary=001a11c36568465b510511b78fe1 X-Virus-Checked: Checked by ClamAV on apache.org --001a11c36568465b510511b78fe1 Content-Type: text/plain; charset=UTF-8 Hi, Sorry for little bit late update, but managed to narrow it little bit down. We didn't update yet, as we are using Hortonworks distribution right now, and even if we update we will get 0.98.4. However, looks that issue here was in our use case and configuration (still looking into it). Basically, initially I saw that when one server goes down, we start having performance issues in general, but it managed to be on our client side, due to caching, and clients were trying to reconnect to nodes that were offline and later trying to get regions from those nodes too. This is basically why on server side I didn't manage to see anything in logs that would be at least little bit interesting and point me into desired direction. Another question that popped up to me is - in case server is down (and with it DataNode and HRegionServer it was hosting) - what's optimal time to set for HMaster to consider server dead reassign regions somewhere else, as this is another performance bottleneck we hit during inability to access regions? In our case it's configured to 15 minutes, and simple logic tells me if you want it earlier then configure lower number of retries, but issue is as always in details, so not sure if anyone knows some better math for this? And last question - is it possible to manually force HBase to reassign regions? In this case, while HMaster is retrying to contact node that's dead, it's impossible to force it using 'balancer' command. Thanks a lot! Dejan On Tue, Mar 17, 2015 at 9:37 AM Dejan Menges wrote: > Hi, > > To be very honest - there's no particular reason why we stick to this one, > beside just lack of time currently to go through upgrade process, but looks > to me that's going to be next step. > > Had a crazy day, didn't have time to go through all logs again, plus one > of the machines (last one where we had this issue) is fully reprovisioned > yesterday so I don't have logs from there anymore. > > Beside upgrading, what I will talk about today, can you just point me to > the specific RPC issue in 0.98.0? Thing is that we have some strange > moments with RPC in this case, and just want to see if that's the same > thing (and we were even suspecting to RPC). > > Thanks a lot! > Dejan > > On Mon, Mar 16, 2015 at 9:32 PM, Andrew Purtell > wrote: > >> Is there a particular reason why you are using HBase 0.98.0? The latest >> 0.98 release is 0.98.11. There's a known performance issue with 0.98.0 >> pertaining to RPC that was fixed in later releases, you should move up >> from >> 0.98.0. In addition hundreds of improvements and bug fixes have gone into >> the ten releases since 0.98.0. >> >> On Mon, Mar 16, 2015 at 6:40 AM, Dejan Menges >> wrote: >> >> > Hi All, >> > >> > We have a strange issue with HBase performance (overall cluster >> > performance) in case one of datanodes in the cluster unexpectedly goes >> > down. >> > >> > So scenario is like follows: >> > - Cluster works fine, it's stable. >> > - One DataNode unexpectedly goes down (PSU issue, network issue, >> anything) >> > - Whole HBase cluster goes down (performance becomes so bad that we >> have to >> > restart all RegionServers to get it back to life). >> > >> > Most funny and latest issue that happened was that we added new node to >> the >> > cluster (having 8 x 4T SATA disks) and we left just DataNode running on >> it >> > to give it couple of days to get some data. At some point in time, due >> to >> > hardware issue, server rebooted (twice during three hours) in moment >> when >> > it had maybe 5% of data it would have in a couple of days. Nothing else >> > beside DataNode was running, and once it went down, it affected literary >> > everything, and restarting RegionServers in the end fixed it. >> > >> > We are using HBase 0.98.0 with Hadoop 2.4.0 >> > >> >> >> >> -- >> Best regards, >> >> - Andy >> >> Problems worthy of attack prove their worth by hitting back. - Piet Hein >> (via Tom White) >> > > --001a11c36568465b510511b78fe1--