Return-Path: X-Original-To: apmail-hbase-user-archive@www.apache.org Delivered-To: apmail-hbase-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id D875510B4A for ; Thu, 27 Feb 2014 17:06:07 +0000 (UTC) Received: (qmail 31322 invoked by uid 500); 27 Feb 2014 17:06:04 -0000 Delivered-To: apmail-hbase-user-archive@hbase.apache.org Received: (qmail 31218 invoked by uid 500); 27 Feb 2014 17:06:04 -0000 Mailing-List: contact user-help@hbase.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@hbase.apache.org Delivered-To: mailing list user@hbase.apache.org Received: (qmail 31210 invoked by uid 99); 27 Feb 2014 17:06:04 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 27 Feb 2014 17:06:04 +0000 X-ASF-Spam-Status: No, hits=1.5 required=5.0 tests=HTML_MESSAGE,NORMAL_HTTP_TO_IP,RCVD_IN_DNSWL_LOW,SPF_PASS,WEIRD_PORT X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: domain of rohitkelkar@gmail.com designates 209.85.223.177 as permitted sender) Received: from [209.85.223.177] (HELO mail-ie0-f177.google.com) (209.85.223.177) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 27 Feb 2014 17:05:59 +0000 Received: by mail-ie0-f177.google.com with SMTP id rl12so2297201iec.36 for ; Thu, 27 Feb 2014 09:05:39 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :content-type; bh=vo/vReWc8sViWseotjkXqXxxMd1OTQ2qjrJoqRuDM0c=; b=jnZvV8RBJbHtqV6wGSMSOvKCH4vN8UA9gwvYI6pYCC13oOQzXgUHJi13V8Q7CBOR4N vLF186YlKaCvFFvTLPfKamo1Fukz3Bg15sYB6JsDA/MNNG58TZisuvEdrmBNLFMIv0Mg lwVMbGyhn1K/zWSBnmcXNH3CQ1M/zAKRXKplJK9YKx3J96MY+wOWjojXYOlJDdfZSG4D ntqKpFN/BpvAA/e2K+04aQ7wek2pCQrP7qYFV4lIXmtCehzrJskMy7d3T/t4Fa19ylex dgmUFEcUwXJQ6MT3y+y2d1AxDe96e+wm5c0qnyDysL+t7hlFLeErW7n3jKmAJvTLdyVt 6Uwg== MIME-Version: 1.0 X-Received: by 10.50.43.194 with SMTP id y2mr7505904igl.33.1393520739414; Thu, 27 Feb 2014 09:05:39 -0800 (PST) Received: by 10.43.82.193 with HTTP; Thu, 27 Feb 2014 09:05:39 -0800 (PST) In-Reply-To: References: Date: Thu, 27 Feb 2014 11:05:39 -0600 Message-ID: Subject: Re: region server dead and datanode block movement error From: Rohit Kelkar To: "user@hbase.apache.org" Content-Type: multipart/alternative; boundary=047d7bfea0b6a9176904f366548e X-Virus-Checked: Checked by ClamAV on apache.org --047d7bfea0b6a9176904f366548e Content-Type: text/plain; charset=ISO-8859-1 Oh yes and forgot to add the ZK process ZK = 5GB Total = 45GB On Thu, Feb 27, 2014 at 11:01 AM, Rohit Kelkar wrote: > Hi Jean-Marc, > > Each node has 48GB RAM > To isolate and debug the RS failure issue, we have switched off all other > tools. The only processes running are > - DN = 4GB > - RS = 6GB > - TT = 4GB > - num mappers available on the node = 4 * 4GB = 16GB > - num reducers available on the node = 2 * 4GB = 8GB > - 4 other java processes unrelated to hadoop/hbase = 512MB * 4 = 2GB > > Total = 40GB > > > On Thu, Feb 27, 2014 at 10:42 AM, Jean-Marc Spaggiari < > jean-marc@spaggiari.org> wrote: > >> 2014-02-21 13:36:27,496 WARN org.apache.hadoop.ipc.HBaseServer: >> (responseTooSlow): >> {"processingtimems":41236,"call":"next(-8680499896692404689, 1), rpc >> version=1, client version=29, methodsFingerPrint=54742778","client":" >> 10.0.0.96:46618 >> >> ","starttimems":1393007746259,"queuetimems":0,"class":"HRegionServer","responsesize":6,"method":"next"} >> 2014-02-21 13:41:00,272 WARN org.apache.hadoop.hbase.util.Sleeper: We >> slept >> 10193644ms instead of 10000000ms, this is likely due to a long garbage >> collecting pause and it's usually bad, see >> http://hbase.apache.org/book.html#trouble.rs.runtime.zkexpired >> >> Your issue is clearly this. >> >> For the swap, it's not because you set swappiness that Linux will not >> swap. >> It will try to not swap, but if it really has to, it will. >> >> How many GB on your server? How many for the DN,for th RS, etc. any TT on >> them? Any other tool? If TT, how many slots? How many GB per slots? >> >> JM >> >> >> 2014-02-27 11:37 GMT-05:00 Rohit Kelkar : >> >> > Hi Jean-Marc, >> > >> > I have updated the RS log here (http://pastebin.com/bVDvMvrB) with >> events >> > before 13:41:00. In the log I see a few responseTooSlow warnings at >> > 13:34:00, 13:36:00. Then no activity till 13:41:00. >> > At 13:41:00 there is a Sleeper warning - WARN >> > org.apache.hadoop.hbase.util.Sleeper: We slept 10193644ms instead of >> > 10000000ms, this is likely due to a long garbage collecting pause and >> it's >> > usually bad, see ... >> > Followed by - INFO org.apache.zookeeper.ClientCnxn: Client session timed >> > out, have not heard from server in 260409ms for sessionid >> > 0x34432befe5417d2, closing socket connection and attempting reconnect. >> > >> > Looking at some of the reasons you mentioned - >> > 1. I analyzed the GC logs for this RS. In the last 10 mins before the RS >> > went down, the GC times are less than 1 sec. Nothing that will take >> 260409 >> > ms as indicated above in the RS log. >> > 2. The RS node has swappiness set to 0 >> > 3. So I think I should investigate the possibility of network issues. >> Any >> > pointers where I could start? >> > >> > - R >> > >> > On Thu, Feb 27, 2014 at 10:17 AM, Jean-Marc Spaggiari < >> > jean-marc@spaggiari.org> wrote: >> > >> > > Hi Rohit, >> > > >> > > Usually YouAreDeadException is when your RegionServer is to slow. It >> gets >> > > kicked out by Master+ZK but then try to join back and get informed it >> has >> > > bene kicked out. >> > > >> > > Reasons: >> > > - Long Gargabe Collection; >> > > - Swapping; >> > > - Network issues (get disconnected, then re-connected); >> > > - etc. >> > > >> > > what do you have before 2014-02-21 13:41:00,308 in the logs? >> > > >> > > >> > > 2014-02-27 11:13 GMT-05:00 Rohit Kelkar : >> > > >> > > > Hi, has anybody been facing similar issues? >> > > > >> > > > - R >> > > > >> > > > >> > > > On Wed, Feb 26, 2014 at 12:55 PM, Rohit Kelkar < >> rohitkelkar@gmail.com >> > > > >wrote: >> > > > >> > > > > We are running hbase 0.94.2 on hadoop 0.20 append version in >> > production >> > > > > (yes we have plans to upgrade hadoop). Its a 5 node cluster and a >> 6th >> > > > node >> > > > > running just the name node and hmaster. >> > > > > I am seeing frequent RS YouAreDeadExceptions. Logs here >> > > > > http://pastebin.com/44aFyYZV >> > > > > The RS log shows a DFSOutputStream ResponseProcessor exception >> for >> > > block >> > > > > blk_-6695300470410774365_837638 java.io.EOFException at 13:41:00 >> > > followed >> > > > > by YouAreDeadException at the same time. >> > > > > I grep'ed this block in the Datanode (see log here >> > > > > http://pastebin.com/2jfwCfcK). At 13:41:00 I see an Exception in >> > > > > receiveBlock for block blk_-6695300470410774365_837638 >> > > > > java.nio.channels.ClosedByInterruptException. >> > > > > I have also attached the namenode logs around the block here >> > > > > http://pastebin.com/9NE9J8s1 >> > > > > >> > > > > Across several RS failure instances I see the following pattern - >> the >> > > > > region server YouAreDeadException is always preceeded by the >> > > EOFException >> > > > > and datanode ClosedByInterruptException >> > > > > >> > > > > Is the error in the movement of the block causing the region >> server >> > to >> > > > > report a YouAreDeadException? And of course, how do I solve this? >> > > > > >> > > > > - R >> > > > > >> > > > >> > > >> > >> > > --047d7bfea0b6a9176904f366548e--