Return-Path: X-Original-To: apmail-hadoop-common-user-archive@www.apache.org Delivered-To: apmail-hadoop-common-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 2CB4E10AEC for ; Tue, 3 Jun 2014 16:53:19 +0000 (UTC) Received: (qmail 83419 invoked by uid 500); 3 Jun 2014 16:45:15 -0000 Delivered-To: apmail-hadoop-common-user-archive@hadoop.apache.org Received: (qmail 83264 invoked by uid 500); 3 Jun 2014 16:45:15 -0000 Mailing-List: contact user-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@hadoop.apache.org Delivered-To: mailing list user@hadoop.apache.org Received: (qmail 83230 invoked by uid 99); 3 Jun 2014 16:45:15 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 03 Jun 2014 16:45:15 +0000 X-ASF-Spam-Status: No, hits=-0.0 required=5.0 tests=RCVD_IN_DNSWL_NONE,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (nike.apache.org: local policy includes SPF record at spf.trusted-forwarder.org) Received: from [212.227.126.131] (HELO mout.kundenserver.de) (212.227.126.131) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 03 Jun 2014 16:45:10 +0000 Received: from localhost.localdomain (host-92-19-252-226.static.as13285.net [92.19.252.226]) by mrelayeu.kundenserver.de (node=mreue005) with ESMTP (Nemesis) id 0MdZVI-1XATXF42EG-00PPKB; Tue, 03 Jun 2014 18:44:46 +0200 From: Ian Brooks To: user@hadoop.apache.org Subject: Re: 100% CPU utilization on idle HDFS Data Nodes Date: Tue, 03 Jun 2014 17:44:44 +0100 Message-ID: <2546422.6xkLWlx4Ft@localhost.localdomain> User-Agent: KMail/4.12.5 (Linux/3.14.4-200.fc20.x86_64; KDE/4.12.5; x86_64; ; ) In-Reply-To: References: MIME-Version: 1.0 Content-Transfer-Encoding: 7Bit Content-Type: text/plain; charset="us-ascii" X-Provags-ID: V02:K0:wpHtnFm8Y+iWHMKWjCuLP6bTcE/y+OkIiK6vSSKw9cL dxxxgXxXR3TPovOfQthjrpg0onAhNO/i8572regNFslrcZ4Uay Wz6V73phQ7FhqEFfp005wW3EhZAFiZO9YWr7Tz7KVVWMVI6PXD AQKbYrvF91KQHBNJlw3BNmv9MjBfCU9w4SU0oxAOiR85xYtWb7 kMcdQm24uwbUYBO2sZOd64GUjBf1wXqL947QocnXte8nHHN/s1 AJqpumXsbTgJX8BCs1p4+iAdhRUbf2IzY7/E2LwB1B4k1V+tuT wLyne6OJxisSOjQlqN8JvmlgVo3UwuWHi6+bz9AFNp7jeJJNCU vAArJ73vjTliLp2H0DcRUIJcXXXIFuQiq5lxxM6DS0vPabvOGD amK0+cfqbd4sg== X-Virus-Checked: Checked by ClamAV on apache.org Hi Shayan, If you restart one of the datanodes, does that node go back to normal cpu usage? if so that looks like the same issue im seeing on my nodes, though mine will go to 200% over time on a 4 cpu host. I havent been able to track the cause down yet. Heavy use of HDFS will cause the node to jump to the 100% sooner and it stays there even when doing very lilttle. Ted, logs from one of my nodes to go with Shayan's 2014-06-03 17:01:28,755 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: PacketResponder: BP-2121456822-10.143.38.149-1396953188241:blk_1074075941_335194, type=LAST_IN_PIPELINE, downstreams=0:[] terminating 2014-06-03 17:06:41,860 INFO org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetAsyncDiskService: Scheduling blk_1074074152_333405 file /home/hadoop/hdfs/datanode/current/BP-2121456822-10.143.38.149-1396953188241/current/finalized/subdir12/blk_1074074152 for deletion 2014-06-03 17:06:41,871 INFO org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetAsyncDiskService: Deleted BP-2121456822-10.143.38.149-1396953188241 blk_1074074152_333405 file /home/hadoop/hdfs/datanode/current/BP-2121456822-10.143.38.149-1396953188241/current/finalized/subdir12/blk_1074074152 2014-06-03 17:08:32,843 WARN org.apache.hadoop.hdfs.server.datanode.DataNode: Got a command from standby NN - ignoring command:2 2014-06-03 17:13:44,320 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: Receiving BP-2121456822-10.143.38.149-1396953188241:blk_1074075943_335196 src: /10.143.38.100:26618 dest: /10.143.38.100:50010 2014-06-03 17:13:44,351 INFO org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace: src: /10.143.38.100:25817, dest: /10.143.38.100:50010, bytes: 863, op: HDFS_WRITE, cliID: DFSClient_hb_rs_sw-hmaster-002,16020,1401717607010_805952715_28, offset: 0, srvID: 62f7174f-a44d-4cc1-b62c-095782a86164, blockid: BP-2121456822-10.143.38.149-1396953188241:blk_1074075489_334742, duration: 3600148963947 2014-06-03 17:13:44,352 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: PacketResponder: BP-2121456822-10.143.38.149-1396953188241:blk_1074075489_334742, type=HAS_DOWNSTREAM_IN_PIPELINE terminating 2014-06-03 17:13:54,029 INFO org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace: src: /10.143.38.104:21368, dest: /10.143.38.100:50010, bytes: 227791, op: HDFS_WRITE, cliID: DFSClient_hb_rs_sw-hadoop-004,16020,1401718647248_-466626109_28, offset: 0, srvID: 62f7174f-a44d-4cc1-b62c-095782a86164, blockid: BP-2121456822-10.143.38.149-1396953188241:blk_1074075491_334744, duration: 3600133166499 2014-06-03 17:13:54,029 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: PacketResponder: BP-2121456822-10.143.38.149-1396953188241:blk_1074075491_334744, type=LAST_IN_PIPELINE, downstreams=0:[] terminating 2014-06-03 17:36:17,405 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: Receiving BP-2121456822-10.143.38.149-1396953188241:blk_1074075950_335203 src: /10.143.38.149:48959 dest: /10.143.38.100:50010 top shows top - 17:41:55 up 18 days, 23:59, 2 users, load average: 1.06, 1.04, 0.93 Tasks: 139 total, 1 running, 137 sleeping, 1 stopped, 0 zombie Cpu(s): 16.4%us, 25.9%sy, 0.0%ni, 56.2%id, 0.0%wa, 0.0%hi, 0.0%si, 1.4%st Mem: 8059432k total, 5870572k used, 2188860k free, 181076k buffers Swap: 835576k total, 0k used, 835576k free, 2493828k cached PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 25450 hadoop 20 0 1728m 332m 17m S 95.8 4.2 49:24.47 java 25147 hbase 20 0 4998m 284m 15m S 13.4 3.6 7:28.30 java 7212 flume 20 0 4558m 399m 20m S 1.9 5.1 0:25.61 java jstack fails, im probably missing something, just not sure what sudo jstack -J-d64 -m 25450 Attaching to process ID 25450, please wait... Debugger attached successfully. Server compiler detected. JVM version is 24.51-b03 Exception in thread "main" java.lang.reflect.InvocationTargetException at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at sun.tools.jstack.JStack.runJStackTool(JStack.java:136) at sun.tools.jstack.JStack.main(JStack.java:102) Caused by: java.lang.RuntimeException: Unable to deduce type of thread from address 0x00007f0104010800 (expected type JavaThread, CompilerThread, ServiceThread, JvmtiAgentThread, or SurrogateLockerThread) at sun.jvm.hotspot.runtime.Threads.createJavaThreadWrapper(Threads.java:162) at sun.jvm.hotspot.runtime.Threads.first(Threads.java:150) at sun.jvm.hotspot.tools.PStack.initJFrameCache(PStack.java:216) at sun.jvm.hotspot.tools.PStack.run(PStack.java:67) at sun.jvm.hotspot.tools.PStack.run(PStack.java:54) at sun.jvm.hotspot.tools.PStack.run(PStack.java:49) at sun.jvm.hotspot.tools.JStack.run(JStack.java:60) at sun.jvm.hotspot.tools.Tool.start(Tool.java:221) at sun.jvm.hotspot.tools.JStack.main(JStack.java:86) ... 6 more Caused by: sun.jvm.hotspot.types.WrongTypeException: No suitable match for type of address 0x00007f0104010800 at sun.jvm.hotspot.runtime.InstanceConstructor.newWrongTypeException(InstanceConstructor.java:62) at sun.jvm.hotspot.runtime.VirtualConstructor.instantiateWrapperFor(VirtualConstructor.java:80) at sun.jvm.hotspot.runtime.Threads.createJavaThreadWrapper(Threads.java:158) ... 14 more -Ian Brooks On Tuesday 03 Jun 2014 12:34:36 Shayan Pooya wrote: > I have a three node HDFS cluster with a name-node. There is absolutely no > IO going on this cluster or any jobs running on it and I just use it for > testing the Disco HDFS integration. > I noticed that two of the three data-nodes are using 100% CPU. They have > been running for a long time (2 months) but with no jobs running on them > and almost no usage: > > $ hadoop version > Hadoop 2.3.0 > Subversion http://svn.apache.org/repos/asf/hadoop/common -r 1567123 > Compiled by jenkins on 2014-02-11T13:40Z > Compiled with protoc 2.5.0 > From source with checksum dfe46336fbc6a044bc124392ec06b85 > > Is this a known bug? -- -Ian Brooks Senior server administrator - Sensewhere