Return-Path: X-Original-To: apmail-hadoop-mapreduce-user-archive@minotaur.apache.org Delivered-To: apmail-hadoop-mapreduce-user-archive@minotaur.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id CB13EEBB7 for ; Mon, 25 Feb 2013 09:14:27 +0000 (UTC) Received: (qmail 35711 invoked by uid 500); 25 Feb 2013 09:14:23 -0000 Delivered-To: apmail-hadoop-mapreduce-user-archive@hadoop.apache.org Received: (qmail 35611 invoked by uid 500); 25 Feb 2013 09:14:23 -0000 Mailing-List: contact user-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@hadoop.apache.org Delivered-To: mailing list user@hadoop.apache.org Delivered-To: moderator for user@hadoop.apache.org Received: (qmail 21603 invoked by uid 99); 25 Feb 2013 09:11:01 -0000 X-ASF-Spam-Status: No, hits=-0.7 required=5.0 tests=RCVD_IN_DNSWL_LOW,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (nike.apache.org: domain of davey.yan@gmail.com designates 209.85.217.182 as permitted sender) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:x-received:date:message-id:subject:from:to :content-type; bh=7R4ApH4Bi5wIXm1b9l3VJfykVx0KSwxzftKz9MJ0ecM=; b=bzq6XUvJ88I+MRAlDTKxf7gfxCDLTpB3u0ZvT9UjWoI6BS9kvnGbGEdh2HDEpXsaRA M30N9TtUZ93u/n4/lHYexcevrJkbDqfz7mXDSETEGVrsJWnV7GxwAedfV0qY8R+Flwj9 BUPgcSYYDbfX/dS+vfgDMHPvWUA/YIU6C09Sp1U65FseTsC24eYb/GTQaFv3ZXgihART r+UMBaROMSLs4j32wbW+CUclRMyak8G/4ACX+w6ce230c0x1XigMpGRfnwqEIPmlANMl TDFhd3L987zd5trrxv+zY2gSj/1IpJBpFNLYZFONim67kxcr5WrnLfx0iWLJN1chpUgu fsaQ== MIME-Version: 1.0 X-Received: by 10.152.136.20 with SMTP id pw20mr9156681lab.16.1361783433840; Mon, 25 Feb 2013 01:10:33 -0800 (PST) Date: Mon, 25 Feb 2013 17:10:33 +0800 Message-ID: Subject: Datanodes shutdown and HBase's regionservers not working From: Davey Yan To: user@hadoop.apache.org Content-Type: text/plain; charset=ISO-8859-1 X-Virus-Checked: Checked by ClamAV on apache.org Hey guys, We have a cluster with 5 nodes(1 NN and 4 DNs) running for more than 1 year, and it works fine. But the datanodes got shutdown twice in the last month. When the datanodes got shutdown, all of them became "Dead Nodes" in the NN web admin UI(http://ip:50070/dfshealth.jsp), but regionservers of HBase were still live in the HBase web admin(http://ip:60010/master-status), of course, they were zombies. All of the processes of jvm were still running, including hmaster/namenode/regionserver/datanode. When the datanodes got shutdown, the load (using the "top" command) of slaves became very high, more than 10, higher than normal running. >From the "top" command, we saw that the processes of datanode and regionserver were comsuming CPU. We could not stop the HBase or Hadoop cluster through normal commands(stop-*.sh/*-daemon.sh stop *). So we stopped datanodes and regionservers by kill -9 PID, then the load of slaves returned to normal level, and we start the cluster again. Log of NN at the shutdown point(All of the DNs were removed): 2013-02-22 11:10:02,278 INFO org.apache.hadoop.net.NetworkTopology: Removing a node: /default-rack/192.168.1.152:50010 2013-02-22 11:10:02,278 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* NameSystem.heartbeatCheck: lost heartbeat from 192.168.1.149:50010 2013-02-22 11:10:02,693 INFO org.apache.hadoop.net.NetworkTopology: Removing a node: /default-rack/192.168.1.149:50010 2013-02-22 11:10:02,693 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* NameSystem.heartbeatCheck: lost heartbeat from 192.168.1.150:50010 2013-02-22 11:10:03,004 INFO org.apache.hadoop.net.NetworkTopology: Removing a node: /default-rack/192.168.1.150:50010 2013-02-22 11:10:03,004 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* NameSystem.heartbeatCheck: lost heartbeat from 192.168.1.148:50010 2013-02-22 11:10:03,339 INFO org.apache.hadoop.net.NetworkTopology: Removing a node: /default-rack/192.168.1.148:50010 Logs in DNs indicated there were many IOException and SocketTimeoutException: 2013-02-22 11:02:52,354 ERROR org.apache.hadoop.hdfs.server.datanode.DataNode: DatanodeRegistration(192.168.1.148:50010, storageID=DS-970284113-117.25.149.160-50010-1328074119937, infoPort=50075, ipcPort=50020):DataXceiver java.io.IOException: Interrupted receiveBlock at org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receiveBlock(BlockReceiver.java:577) at org.apache.hadoop.hdfs.server.datanode.DataXceiver.writeBlock(DataXceiver.java:398) at org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:107) at java.lang.Thread.run(Thread.java:662) 2013-02-22 11:03:44,823 WARN org.apache.hadoop.hdfs.server.datanode.DataNode: DatanodeRegistration(192.168.1.148:50010, storageID=DS-970284113-117.25.149.160-50010-1328074119937, infoPort=50075, ipcPort=50020):Got exception while serving blk_-1985405101514576650_247001 to /192.168.1.148: java.net.SocketTimeoutException: 480000 millis timeout while waiting for channel to be ready for write. ch : java.nio.channels.SocketChannel[connected local=/192.168.1.148:50010 remote=/192.168.1.148:48654] at org.apache.hadoop.net.SocketIOWithTimeout.waitForIO(SocketIOWithTimeout.java:246) at org.apache.hadoop.net.SocketOutputStream.waitForWritable(SocketOutputStream.java:159) at org.apache.hadoop.net.SocketOutputStream.transferToFully(SocketOutputStream.java:198) at org.apache.hadoop.hdfs.server.datanode.BlockSender.sendChunks(BlockSender.java:350) at org.apache.hadoop.hdfs.server.datanode.BlockSender.sendBlock(BlockSender.java:436) at org.apache.hadoop.hdfs.server.datanode.DataXceiver.readBlock(DataXceiver.java:197) at org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:99) at java.lang.Thread.run(Thread.java:662) 2013-02-22 11:09:42,294 ERROR org.apache.hadoop.hdfs.server.datanode.DataNode: DatanodeRegistration(192.168.1.148:50010, storageID=DS-970284113-117.25.149.160-50010-1328074119937, infoPort=50075, ipcPort=50020):DataXceiver java.net.SocketTimeoutException: 480000 millis timeout while waiting for channel to be ready for write. ch : java.nio.channels.SocketChannel[connected local=/192.168.1.148:50010 remote=/192.168.1.148:37188] at org.apache.hadoop.net.SocketIOWithTimeout.waitForIO(SocketIOWithTimeout.java:246) at org.apache.hadoop.net.SocketOutputStream.waitForWritable(SocketOutputStream.java:159) at org.apache.hadoop.net.SocketOutputStream.transferToFully(SocketOutputStream.java:198) at org.apache.hadoop.hdfs.server.datanode.BlockSender.sendChunks(BlockSender.java:350) at org.apache.hadoop.hdfs.server.datanode.BlockSender.sendBlock(BlockSender.java:436) at org.apache.hadoop.hdfs.server.datanode.DataXceiver.readBlock(DataXceiver.java:197) at org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:99) at java.lang.Thread.run(Thread.java:662) 2013-02-22 11:12:41,892 INFO org.apache.hadoop.hdfs.server.datanode.DataBlockScanner: Verification succeeded for blk_-2674357249542194287_43419 Here is our env: hadoop 1.0.3 hbase 0.94.1(snappy enabled) java version "1.6.0_31" Java(TM) SE Runtime Environment (build 1.6.0_31-b04) Java HotSpot(TM) 64-Bit Server VM (build 20.6-b01, mixed mode) # ulimit -a core file size (blocks, -c) 0 data seg size (kbytes, -d) unlimited scheduling priority (-e) 20 file size (blocks, -f) unlimited pending signals (-i) 16382 max locked memory (kbytes, -l) 64 max memory size (kbytes, -m) unlimited open files (-n) 32768 pipe size (512 bytes, -p) 8 POSIX message queues (bytes, -q) 819200 real-time priority (-r) 0 stack size (kbytes, -s) 8192 cpu time (seconds, -t) unlimited max user processes (-u) 32768 virtual memory (kbytes, -v) unlimited file locks (-x) unlimited # uname -a Linux ubuntu6401 2.6.32-33-server #70-Ubuntu SMP Thu Jul 7 22:28:30 UTC 2011 x86_64 GNU/Linux # free(master) total used free shared buffers cached Mem: 24732936 8383708 16349228 0 490584 2580356 -/+ buffers/cache: 5312768 19420168 Swap: 72458232 0 72458232 # free(slaves) total used free shared buffers cached Mem: 24733000 22824276 1908724 0 862556 15303304 -/+ buffers/cache: 6658416 18074584 Swap: 72458232 264 72457968 Some important conf: core-site.xml io.file.buffer.size 65536 hdfs-site.xml dfs.block.size 134217728 dfs.datanode.max.xcievers 4096 dfs.support.append true dfs.replication 2 Hope you can help us. Thanks in advance. -- Davey Yan