Return-Path: X-Original-To: apmail-hbase-user-archive@www.apache.org Delivered-To: apmail-hbase-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 66F3711F91 for ; Thu, 5 Jun 2014 11:34:05 +0000 (UTC) Received: (qmail 93853 invoked by uid 500); 5 Jun 2014 11:34:02 -0000 Delivered-To: apmail-hbase-user-archive@hbase.apache.org Received: (qmail 93781 invoked by uid 500); 5 Jun 2014 11:34:02 -0000 Mailing-List: contact user-help@hbase.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@hbase.apache.org Delivered-To: mailing list user@hbase.apache.org Received: (qmail 93769 invoked by uid 99); 5 Jun 2014 11:34:02 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 05 Jun 2014 11:34:02 +0000 X-ASF-Spam-Status: No, hits=-0.0 required=5.0 tests=RCVD_IN_DNSWL_NONE,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: local policy includes SPF record at spf.trusted-forwarder.org) Received: from [212.227.126.131] (HELO mout.kundenserver.de) (212.227.126.131) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 05 Jun 2014 11:33:57 +0000 Received: from localhost.localdomain (host-92-19-252-226.static.as13285.net [92.19.252.226]) by mrelayeu.kundenserver.de (node=mreue006) with ESMTP (Nemesis) id 0M4P8q-1WZs2t0eHU-00ydo8; Thu, 05 Jun 2014 13:33:35 +0200 From: Ian Brooks To: user@hbase.apache.org Subject: Re: regionserver crash after node decomission Date: Thu, 05 Jun 2014 12:33:30 +0100 Message-ID: <1514667.PoafyvNdBk@localhost.localdomain> User-Agent: KMail/4.12.5 (Linux/3.14.4-200.fc20.x86_64; KDE/4.12.5; x86_64; ; ) In-Reply-To: References: <45221639.aLRyQlZNKx@localhost.localdomain> <3308642.4j6kRb1OgJ@localhost.localdomain> MIME-Version: 1.0 Content-Transfer-Encoding: 7Bit Content-Type: text/plain; charset="us-ascii" X-Provags-ID: V02:K0:2pWQ/Fc650+S+Alov07mXz42CSf3Gesoa8IvEfxjY8U ASbZIzBHdRYyn/SOwwjyMHAwUvG9ULm36t+hqQSVDfBV3SaOlx Uyiwyhv8IiTXjW/zxTs2stnkb1XlCUMF5YjUmRdeTFWTNuUuJw RKhI5FctgFFcTshBls7IWazmdk2WBWg8pMfmoATbaY+SkDgqss lGsYuwXdT1yO0xLMbAWG4G6RlFsrvF/vmOMV3bcFq7t28ajh9T vGTdGIUtaL3iMInJnqPLISoC9JNU31M014UrsxvkCm7C+voHMI Xk60AuUYKojYDWxlkFfjNkEVbOQy3Jyufdhk+pzoGUaGXEpGBX AeLEYM2xcIibci2AfYXgRK+ywz/p5wxO/9asl1HtvhoQCTdaMC aRQUdpq8f8wJg== X-Virus-Checked: Checked by ClamAV on apache.org Hi, The logs around that time foe .112 are 2 2014-06-04 13:55:08,797 INFO org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace: src: /10.143.38.112:50010, dest: /10.143.38.105:34986, bytes: 131072, op: HDFS_READ, cliID: DFSClient_hb_rs_sw-hadoop-004,16020,1401718647248_343372106_28, offset: 0, srvID: 112a7edd-e879-4497-bc3b-0e4d5b88f45b, blockid: BP-2121456822-10.143.38.149-1396953188241:blk_1073994110_253327, duration: 84803151 2014-06-04 13:55:53,254 INFO org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace: src: /10.143.38.112:50010, dest: /10.143.38.112:32763, bytes: 5794, op: HDFS_READ, cliID: DFSClient_hb_rs_sw-hadoop-005,16020,1401885988003_-537176952_28, offset: 0, srvID: 112a7edd-e879-4497-bc3b-0e4d5b88f45b, blockid: BP-2121456822-10.143.38.149-1396953188241:blk_1074085666_344919, duration: 372334 2014-06-04 13:56:02,891 INFO org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace: src: /10.143.38.112:50010, dest: /10.143.38.112:32764, bytes: 5794, op: HDFS_READ, cliID: DFSClient_hb_rs_sw-hadoop-005,16020,1401885988003_-537176952_28, offset: 0, srvID: 112a7edd-e879-4497-bc3b-0e4d5b88f45b, blockid: BP-2121456822-10.143.38.149-1396953188241:blk_1074085666_344919, duration: 429859 2014-06-04 13:57:09,690 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: Receiving BP-2121456822-10.143.38.149-1396953188241:blk_1074085687_344940 src: /10.143.38.112:32770 dest: /10.143.38.112:50010 2014-06-04 14:00:11,213 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: Receiving BP-2121456822-10.143.38.149-1396953188241:blk_1074085689_344942 src: /10.143.38.104:45623 dest: /10.143.38.112:50010 2014-06-04 14:00:15,438 INFO org.apache.hadoop.util.JvmPauseMonitor: Detected pause in JVM or host machine (eg GC): pause of approximately 1223ms No GCs detected 2014-06-04 14:00:22,136 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: Receiving BP-2121456822-10.143.38.149-1396953188241:blk_1074085691_344944 src: /10.143.38.149:30394 dest: /10.143.38.112:50010 2014-06-04 14:00:22,152 ERROR org.apache.hadoop.hdfs.server.datanode.DataNode: DataNode{data=FSDataset{dirpath='[/home/hadoop/hdfs/datanode/current]'}, localName='sw-hadoop-005:50010', datanodeUuid='112a7edd-e879-4497-bc3b-0e4d5b88f45b', xmitsInProgress=0}:Exception transfering block BP-2121456822-10.143.38.149-1396953188241:blk_1074085691_344944 to mirror 10.143.38.105:50010: java.net.ConnectException: Connection refused 2014-06-04 14:00:22,152 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: opWriteBlock BP-2121456822-10.143.38.149-1396953188241:blk_1074085691_344944 received exception java.net.ConnectException: Connection refused 2014-06-04 14:00:22,152 ERROR org.apache.hadoop.hdfs.server.datanode.DataNode: sw-hadoop-005:50010:DataXceiver error processing WRITE_BLOCK operation src: /10.143.38.149:30394 dest: /10.143.38.112:50010 java.net.ConnectException: Connection refused at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method) at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:739) at org.apache.hadoop.net.SocketIOWithTimeout.connect(SocketIOWithTimeout.java:206) at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:529) at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:493) at org.apache.hadoop.hdfs.server.datanode.DataXceiver.writeBlock(DataXceiver.java:484) at org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.opWriteBlock(Receiver.java:115) at org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.processOp(Receiver.java:68) at org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:221) at java.lang.Thread.run(Thread.java:744) 2014-06-04 14:00:22,365 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: Receiving BP-2121456822-10.143.38.149-1396953188241:blk_1074085694_344947 src: /10.143.38.115:20551 dest: /10.143.38.112:50010 2014-06-04 14:00:22,368 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: Receiving BP-2121456822-10.143.38.149-1396953188241:blk_1074085697_344950 src: /10.143.38.104:45654 dest: /10.143.38.112:50010 The failure at 14:00 could be related, though it is 3 seconds later and all hosts are time syncronised. There is nothing in the logs on the hosts to suggest the clock was adjusted by any amount. -Ian Brooks On Wed, Jun 4, 2014 at 7:08 AM, Ian Brooks wrote: > Hi > > Well i performed the procedure on another 4 nodes today and no node fell > over from it. so perhaps I was just very unlucky with the 2 previous > attempts. There is an issue in DFSClient. Need to take a look... > When I shut down the datanodes I saw errors int he logs, but they are to > be expected and the servers just continued as normal after that. The only > slight oddity was the datanode in question for the log below was > 10.143.38.105 so I'm not sure why it was complaining about 10.143.38.112 > and 10.143.38.116 as they were both running and healthy. > > > Anything in logs of 10.143.38.112 around this time? The ERROR message is composed by catch on DataXceiver#run. It has 10.143.38.116 as the localAddress and .112 as remote (it closed the connection, the 'Connection reset by peer'?) St.Ack > > 014-06-04 13:57:07,641 INFO > org.apache.hadoop.hdfs.server.datanode.DataNode: PacketResponder: > BP-2121456822-10.143.38.149-1396953188241:blk_1074085660_344913, > type=HAS_DOWNSTREAM_IN_PIPELINE > java.io.EOFException: Premature EOF: no length prefix available > at > org.apache.hadoop.hdfs.protocolPB.PBHelper.vintPrefixed(PBHelper.java:1883) > at > org.apache.hadoop.hdfs.protocol.datatransfer.PipelineAck.readFields(PipelineAck.java:116) > at > org.apache.hadoop.hdfs.server.datanode.BlockReceiver$PacketResponder.run(BlockReceiver.java:958) > at java.lang.Thread.run(Thread.java:744) > 2014-06-04 13:57:07,646 INFO > org.apache.hadoop.hdfs.server.datanode.DataNode: Exception for > BP-2121456822-10.143.38.149-1396953188241:blk_1074085660_344913 > java.io.IOException: Connection reset by peer > at sun.nio.ch.FileDispatcherImpl.read0(Native Method) > at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:39) > at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:223) > at sun.nio.ch.IOUtil.read(IOUtil.java:197) > at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:379) > at > org.apache.hadoop.net.SocketInputStream$Reader.performIO(SocketInputStream.java:57) > at > org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:142) > at > org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:161) > at > org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:131) > at java.io.BufferedInputStream.fill(BufferedInputStream.java:235) > at java.io.BufferedInputStream.read1(BufferedInputStream.java:275) > at java.io.BufferedInputStream.read(BufferedInputStream.java:334) > at java.io.DataInputStream.read(DataInputStream.java:149) > at org.apache.hadoop.io.IOUtils.readFully(IOUtils.java:192) > at > org.apache.hadoop.hdfs.protocol.datatransfer.PacketReceiver.doReadFully(PacketReceiver.java:213) > at > org.apache.hadoop.hdfs.protocol.datatransfer.PacketReceiver.doRead(PacketReceiver.java:134) > at > org.apache.hadoop.hdfs.protocol.datatransfer.PacketReceiver.receiveNextPacket(PacketReceiver.java:109) > at > org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receivePacket(BlockReceiver.java:442) > at > org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receiveBlock(BlockReceiver.java:701) > at > org.apache.hadoop.hdfs.server.datanode.DataXceiver.writeBlock(DataXceiver.java:572) > at > org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.opWriteBlock(Receiver.java:115) > at > org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.processOp(Receiver.java:68) > at > org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:221) > at java.lang.Thread.run(Thread.java:744) > 2014-06-04 13:57:07,647 WARN > org.apache.hadoop.hdfs.server.datanode.DataNode: IOException in > BlockReceiver.run(): > java.nio.channels.ClosedByInterruptException > at > java.nio.channels.spi.AbstractInterruptibleChannel.end(AbstractInterruptibleChannel.java:202) > at sun.nio.ch.SocketChannelImpl.write(SocketChannelImpl.java:496) > at > org.apache.hadoop.net.SocketOutputStream$Writer.performIO(SocketOutputStream.java:63) > at > org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:142) > at > org.apache.hadoop.net.SocketOutputStream.write(SocketOutputStream.java:159) > at > org.apache.hadoop.net.SocketOutputStream.write(SocketOutputStream.java:117) > at > java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java:82) > at > java.io.BufferedOutputStream.flush(BufferedOutputStream.java:140) > at java.io.DataOutputStream.flush(DataOutputStream.java:123) > at > org.apache.hadoop.hdfs.server.datanode.BlockReceiver$PacketResponder.run(BlockReceiver.java:1087) > at java.lang.Thread.run(Thread.java:744) > 2014-06-04 13:57:07,648 WARN > org.apache.hadoop.hdfs.server.datanode.DataNode: checkDiskError: exception: > java.nio.channels.ClosedByInterruptException > at > java.nio.channels.spi.AbstractInterruptibleChannel.end(AbstractInterruptibleChannel.java:202) > at sun.nio.ch.SocketChannelImpl.write(SocketChannelImpl.java:496) > at > org.apache.hadoop.net.SocketOutputStream$Writer.performIO(SocketOutputStream.java:63) > at > org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:142) > at > org.apache.hadoop.net.SocketOutputStream.write(SocketOutputStream.java:159) > at > org.apache.hadoop.net.SocketOutputStream.write(SocketOutputStream.java:117) > at > java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java:82) > at > java.io.BufferedOutputStream.flush(BufferedOutputStream.java:140) > at java.io.DataOutputStream.flush(DataOutputStream.java:123) > at > org.apache.hadoop.hdfs.server.datanode.BlockReceiver$PacketResponder.run(BlockReceiver.java:1087) > at java.lang.Thread.run(Thread.java:744) > 2014-06-04 13:57:07,648 INFO > org.apache.hadoop.hdfs.server.datanode.DataNode: Not checking disk as > checkDiskError was called on a network related exception > 2014-06-04 13:57:07,648 INFO > org.apache.hadoop.hdfs.server.datanode.DataNode: PacketResponder: > BP-2121456822-10.143.38.149-1396953188241:blk_1074085660_344913, > type=HAS_DOWNSTREAM_IN_PIPELINE > java.nio.channels.ClosedByInterruptException > at > java.nio.channels.spi.AbstractInterruptibleChannel.end(AbstractInterruptibleChannel.java:202) > at sun.nio.ch.SocketChannelImpl.write(SocketChannelImpl.java:496) > at > org.apache.hadoop.net.SocketOutputStream$Writer.performIO(SocketOutputStream.java:63) > at > org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:142) > at > org.apache.hadoop.net.SocketOutputStream.write(SocketOutputStream.java:159) > at > org.apache.hadoop.net.SocketOutputStream.write(SocketOutputStream.java:117) > at > java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java:82) > at > java.io.BufferedOutputStream.flush(BufferedOutputStream.java:140) > at java.io.DataOutputStream.flush(DataOutputStream.java:123) > at > org.apache.hadoop.hdfs.server.datanode.BlockReceiver$PacketResponder.run(BlockReceiver.java:1087) > at java.lang.Thread.run(Thread.java:744) > 2014-06-04 13:57:07,648 INFO > org.apache.hadoop.hdfs.server.datanode.DataNode: PacketResponder: > BP-2121456822-10.143.38.149-1396953188241:blk_1074085660_344913, > type=HAS_DOWNSTREAM_IN_PIPELINE terminating > 2014-06-04 13:57:07,648 INFO > org.apache.hadoop.hdfs.server.datanode.DataNode: opWriteBlock > BP-2121456822-10.143.38.149-1396953188241:blk_1074085660_344913 received > exception java.io.IOException: Connection reset by peer > 2014-06-04 13:57:07,649 ERROR > org.apache.hadoop.hdfs.server.datanode.DataNode: > sw-hadoop-007:50010:DataXceiver error processing WRITE_BLOCK operation > src: /10.143.38.112:24190 dest: /10.143.38.116:50010 > java.io.IOException: Connection reset by peer > at sun.nio.ch.FileDispatcherImpl.read0(Native Method) > at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:39) > at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:223) > at sun.nio.ch.IOUtil.read(IOUtil.java:197) > at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:379) > at > org.apache.hadoop.net.SocketInputStream$Reader.performIO(SocketInputStream.java:57) > at > org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:142) > at > org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:161) > at > org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:131) > at java.io.BufferedInputStream.fill(BufferedInputStream.java:235) > at java.io.BufferedInputStream.read1(BufferedInputStream.java:275) > at java.io.BufferedInputStream.read(BufferedInputStream.java:334) > at java.io.DataInputStream.read(DataInputStream.java:149) > at org.apache.hadoop.io.IOUtils.readFully(IOUtils.java:192) > at > org.apache.hadoop.hdfs.protocol.datatransfer.PacketReceiver.doReadFully(PacketReceiver.java:213) > at > org.apache.hadoop.hdfs.protocol.datatransfer.PacketReceiver.doRead(PacketReceiver.java:134) > at > org.apache.hadoop.hdfs.protocol.datatransfer.PacketReceiver.receiveNextPacket(PacketReceiver.java:109) > at > org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receivePacket(BlockReceiver.java:442) > at > org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receiveBlock(BlockReceiver.java:701) > at > org.apache.hadoop.hdfs.server.datanode.DataXceiver.writeBlock(DataXceiver.java:572) > at > org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.opWriteBlock(Receiver.java:115) > at > org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.processOp(Receiver.java:68) > at > org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:221) > at java.lang.Thread.run(Thread.java:744) > > > -Ian Brooks