Return-Path: X-Original-To: apmail-hbase-user-archive@www.apache.org Delivered-To: apmail-hbase-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 7C8BFD921 for ; Mon, 30 Jul 2012 21:07:13 +0000 (UTC) Received: (qmail 4270 invoked by uid 500); 30 Jul 2012 21:07:11 -0000 Delivered-To: apmail-hbase-user-archive@hbase.apache.org Received: (qmail 4202 invoked by uid 500); 30 Jul 2012 21:07:11 -0000 Mailing-List: contact user-help@hbase.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@hbase.apache.org Delivered-To: mailing list user@hbase.apache.org Received: (qmail 4192 invoked by uid 99); 30 Jul 2012 21:07:11 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 30 Jul 2012 21:07:11 +0000 X-ASF-Spam-Status: No, hits=-0.7 required=5.0 tests=FSL_RCVD_USER,RCVD_IN_DNSWL_LOW,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: domain of nkeywal@gmail.com designates 209.85.212.179 as permitted sender) Received: from [209.85.212.179] (HELO mail-wi0-f179.google.com) (209.85.212.179) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 30 Jul 2012 21:07:07 +0000 Received: by wibhq4 with SMTP id hq4so1851027wib.2 for ; Mon, 30 Jul 2012 14:06:45 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :cc:content-type; bh=Lmxyko+itCF9bnaT1iYL21LuREx/ph7TzC59s0huDnw=; b=wdPcs8gMCBvlVS0gQ79SKeqXM+yW18HgFW1Omiu6dS+yBV/J3Ty0DWkhV4SYfdv9Xo RI54V4pIOiVZ0UHdue76buq16KNPqVT92zVbFiHeDdEoL0frjgvfYycLqy38izcjDRf/ RGWyuyX1TaYanqitUZDmNg8S3vSq13fuuLRQmsN0/8e4FWxYI0kWBb001VwhcSRdHJPV vNn8fXs22GYK4pvuxtkRDxASQXb53no0+f+2qP24ZRLA1T88sCAGjmdm3uchvKuFnll4 JK+oD2jDicfIOTXM2KN6jNVTUZKiDKucOiXfCRhMFKyLzQLB/rfesXX3WRa/by8wdHIw txyQ== MIME-Version: 1.0 Received: by 10.216.136.158 with SMTP id w30mr6127353wei.136.1343682405604; Mon, 30 Jul 2012 14:06:45 -0700 (PDT) Received: by 10.194.17.161 with HTTP; Mon, 30 Jul 2012 14:06:45 -0700 (PDT) In-Reply-To: <5016CBE4.9010701@gmail.com> References: <5016B9CB.2050506@gmail.com> <5016CBE4.9010701@gmail.com> Date: Mon, 30 Jul 2012 23:06:45 +0200 Message-ID: Subject: Re: Region Server failure due to remote data node errors From: N Keywal To: Jay T Cc: user@hbase.apache.org Content-Type: text/plain; charset=ISO-8859-1 X-Virus-Checked: Checked by ClamAV on apache.org Hi Jay, As you said aldready, the pipeline for blk_5116092240243398556_489796 contains only dead nodes, and this is likely the cause for the wrong behavior. This block is used by a hlog file, created just before the error. I don't get why there are 3 nodes in the pipeline, I would expect only 2. Do you have a specific setting for dfs.replication? Log files are specific, HBase checks that the replication really occurs by checking the replication count, and close them if it's not ok. But it seems that all the nodes are dead from the start, and this could be ill-managed in HBase. Reproducing this may be difficult, but should be possible. Then the region server is stopped, but I didn't see in the logs what was the path for this, so it's surprising to say the least. After this, all the errors on 'already closed are not that critical imho: the close will fail as hdfs closes the file when it cannot recover from an error. I guess your question is still opened. But from what I see it could be a HBase bug. I will be interested to know the conclusions of your analysis... Nicolas On Mon, Jul 30, 2012 at 8:01 PM, Jay T wrote: > Thanks for the quick reply Nicolas. We are using HBase 0.94 on Hadoop > 1.0.3. > > I have uploaded the logs here: > > Region Server log: http://pastebin.com/QEQ22UnU > Data Node log: http://pastebin.com/DF0JNL8K > > Appreciate your help in figuring this out. > > Thanks, > Jay > > > > > On 7/30/12 1:02 PM, N Keywal wrote: >> >> Hi Jay, >> >> Yes, the whole log would be interesting, plus the logs of the datanode >> on the same box as the dead RS. >> What's your hbase& hdfs versions? >> >> >> The RS should be immune to hdfs errors. There are known issues (see >> HDFS-3701), but it seems you have something different... >> This: >>> >>> java.nio.channels.SocketChannel[connected local=/10.128.204.225:52949 >>> remote=/10.128.204.225:50010] >> >> Seems to say that the error was between the datanode on the same box as >> the RS? >> >> Nicolas >> >> On Mon, Jul 30, 2012 at 6:43 PM, Jay T wrote: >>> >>> A couple of our region servers (in a 16 node cluster) crashed due to >>> underlying Data Node errors. I am trying to understand how errors on >>> remote >>> data nodes impact other region server processes. >>> >>> *To briefly describe what happened: >>> * >>> 1) Cluster was in operation. All 16 nodes were up, reads and writes were >>> happening extensively. >>> 2) Nodes 7 and 8 were shutdown for maintenance. (No graceful shutdown DN >>> and >>> RS service were running and the power was just pulled out) >>> 3) Nodes 2 and 5 flushed and DFS client started reporting errors. From >>> the >>> log it seems like DFS blocks were being replicated to the nodes that were >>> shutdown (7 and 8) and since replication could not go through >>> successfully >>> DFS client raised errors on 2 and 5 and eventually the RS itself died. >>> >>> The question I am trying to get an answer for is : Is a Region Server >>> immune >>> from remote data node errors (that are part of the replication pipeline) >>> or >>> not. ? >>> * >>> Part of the Region Server Log:* (Node 5) >>> >>> 2012-07-26 18:53:15,245 INFO org.apache.hadoop.hdfs.DFSClient: Exception >>> in >>> createBlockOutputStream 10.128.204.225:50010 java.io.IOException: Bad >>> connect ack with firstBadLink >>> as 10.128.204.228:50010 >>> 2012-07-26 18:53:15,245 INFO org.apache.hadoop.hdfs.DFSClient: Abandoning >>> block blk_-316956372096761177_489798 >>> 2012-07-26 18:53:15,246 INFO org.apache.hadoop.hdfs.DFSClient: Excluding >>> datanode 10.128.204.228:50010 >>> 2012-07-26 18:53:16,903 INFO >>> org.apache.hadoop.hbase.regionserver.StoreFile: >>> NO General Bloom and NO DeleteFamily was added to HFile >>> (hdfs://Node101:8020/hbase/table/754de060 >>> c9d96286e0c8cd200716ffde/.tmp/26f5cd1fb2cb4547972a31073d2da124) >>> 2012-07-26 18:53:16,903 INFO org.apache.hadoop.hbase.regionserver.Store: >>> Flushed , sequenceid=4046717645, memsize=256.5m, into tmp file >>> hdfs://Node101:8020/hbase/table/754de0 >>> >>> 60c9d96286e0c8cd200716ffde/.tmp/26f5cd1fb2cb4547972a31073d2da1242012-07-26 >>> 18:53:16,907 DEBUG org.apache.hadoop.hbase.regionserver.Store: Renaming >>> flushed file at >>> >>> hdfs://Node101:8020/hbase/table/754de060c9d96286e0c8cd200716ffde/.tmp/26f5c >>> d1fb2cb4547972a31073d2da124 to >>> >>> hdfs://Node101:8020/hbase/table/754de060c9d96286e0c8cd200716ffde/CF/26f5cd1fb2cb4547972a31073d2da124 >>> 2012-07-26 18:53:16,921 INFO org.apache.hadoop.hbase.regionserver.Store: >>> Added >>> >>> hdfs://Node101:8020/hbase/table/754de060c9d96286e0c8cd200716ffde/CF/26f5cd1fb2cb4547972a31073d2d >>> a124, entries=1137956, sequenceid=4046717645, filesize=13.2m2012-07-26 >>> 18:53:32,048 WARN org.apache.hadoop.hdfs.DFSClient: DataStreamer >>> Exception: >>> java.net.SocketTimeoutException: 15000 millis timeout while waiting for >>> channel to be ready for write. ch : >>> java.nio.channels.SocketChannel[connected local=/10.128.204.225:52949 >>> remote=/10.128.204.225:50010] >>> at >>> >>> org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:164) >>> at >>> >>> org.apache.hadoop.net.SocketOutputStream.write(SocketOutputStream.java:146) >>> at >>> >>> org.apache.hadoop.net.SocketOutputStream.write(SocketOutputStream.java:107) >>> at java.io.BufferedOutputStream.write(BufferedOutputStream.java:105) >>> at java.io.DataOutputStream.write(DataOutputStream.java:90) >>> at >>> >>> org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$DataStreamer.run(DFSClient.java:2857) >>> 2012-07-26 18:53:32,049 WARN org.apache.hadoop.hdfs.DFSClient: Error >>> Recovery for block blk_5116092240243398556_489796 bad datanode[0] >>> 10.128.204.225:50010 >>> 2012-07-26 18:53:32,049 WARN org.apache.hadoop.hdfs.DFSClient: Error >>> Recovery for block blk_5116092240243398556_489796 in pipeline >>> 10.128.204.225:50010, 10.128.204.221:50010, 10.128.204.227:50010: bad >>> datanode 10.128.204.225:50010 >>> >>> I can pastebin the entire log but this is when things started going wrong >>> for Node 5 and eventually shutdown hook for RS started and the RS was >>> shutdown. >>> >>> Any help in troubleshooting this is greatly appreciated. >>> >>> Thanks, >>> Jay > >