hadoop-mapreduce-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Kai Ju Liu <ka...@tellapart.com>
Subject Re: High load, low CPU on hard-to-reach instances
Date Tue, 02 Aug 2011 04:41:14 GMT
Since migrating HDFS off of EBS-mounted drives and onto ephemeral drives,
this issue has not resurfaced. If anyone else experiences these issues in
the AWS stack, it's definitely worth considering migrating onto physical
disks.

Kai Ju

On Wed, Jul 6, 2011 at 10:34 AM, Kai Ju Liu <kaiju@tellapart.com> wrote:

> I'll have to check in detail today when the issue (hopefully) resurfaces. I
> believe that in certain cases, the amount of free memory is low, but the
> memory is split roughly equally between "used" and "cached". Is that
> consistent with what you've seen in the past as well? Thanks!
>
> Kai Ju
>
> On Tue, Jul 5, 2011 at 9:01 PM, Matei Zaharia <matei@eecs.berkeley.edu>wrote:
>
>> What does the memory load look like on them? The one time I've seen stuff
>> like this happen regularly is with too much memory in use.
>>
>> Matei
>>
>> On Jul 5, 2011, at 9:36 PM, Kai Ju Liu wrote:
>>
>> Over the past week or two, I've been seeing an issue where hard-to-reach
>> (i.e. hard to ssh to) instances exhibit high load but low CPU. These
>> instances are hosted in EC2, of type c1.xlarge with 4 attached EBS volumes,
>> and run Ubuntu 10.04.1 with the 2.6.32-309-ec2 kernel.
>>
>> All of these instances serve as datanodes within Hadoop (CDH3u0
>> distribution). The first indication that something is wrong is that the
>> datanodes stop emitting heartbeats to the namenode. After around 10-15
>> minutes, the load on these instances recover and the datanodes resume
>> emitting heartbeats. Below is an example of the state of an instance once I
>> finally logged onto it.
>>
>> Has anyone seen similar behavior either while running Hadoop or generally
>> across the board? I'm trying to narrow down whether these issues are
>> EBS-related, kernel-related, Hadoop-related, or something else entirely.
>> Thanks!
>>
>> Kai Ju
>>
>> ------------------------------
>>
>> hadoop@ip-10-66-135-31:~$ uptime
>> 00:26:19 up 3 days, 17:08, 2 users, load average: 38.54, 32.31, 19.21
>>
>> hadoop@ip-10-66-135-31:~$ ps -ef
>> UID PID PPID C STIME TTY TIME CMD
>> root 1 0 0 Jul02 ? 00:00:00 /sbin/init
>> root 2 0 0 Jul02 ? 00:00:00 kthreadd <https://forums.aws.amazon.com/>
>> root 3 2 0 Jul02 ? 00:00:00 migration/0
>> root 4 2 0 Jul02 ? 00:00:01 ksoftirqd/0
>> root 5 2 0 Jul02 ? 00:00:00 watchdog/0
>> root 6 2 0 Jul02 ? 00:00:08 events/0
>> root 7 2 0 Jul02 ? 00:00:00 cpuset <https://forums.aws.amazon.com/>
>> root 8 2 0 Jul02 ? 00:00:00 khelper <https://forums.aws.amazon.com/>
>> root 9 2 0 Jul02 ? 00:00:00 netns <https://forums.aws.amazon.com/>
>> root 10 2 0 Jul02 ? 00:00:00 async/mgr
>> root 11 2 0 Jul02 ? 00:00:00 xenwatch <https://forums.aws.amazon.com/>
>> root 12 2 0 Jul02 ? 00:00:00 xenbus <https://forums.aws.amazon.com/>
>> root 14 2 0 Jul02 ? 00:00:00 migration/1
>> root 15 2 0 Jul02 ? 00:00:00 ksoftirqd/1
>> root 16 2 0 Jul02 ? 00:00:00 watchdog/1
>> root 17 2 0 Jul02 ? 00:00:05 events/1
>> root 18 2 0 Jul02 ? 00:00:00 migration/2
>> root 19 2 0 Jul02 ? 00:00:00 ksoftirqd/2
>> root 20 2 0 Jul02 ? 00:00:00 watchdog/2
>> root 21 2 0 Jul02 ? 00:00:05 events/2
>> root 22 2 0 Jul02 ? 00:00:00 migration/3
>> root 23 2 0 Jul02 ? 00:00:00 ksoftirqd/3
>> root 24 2 0 Jul02 ? 00:00:00 watchdog/3
>> root 25 2 0 Jul02 ? 00:00:05 events/3
>> root 26 2 0 Jul02 ? 00:00:00 migration/4
>> root 27 2 0 Jul02 ? 00:00:00 ksoftirqd/4
>> root 28 2 0 Jul02 ? 00:00:00 watchdog/4
>> root 29 2 0 Jul02 ? 00:00:05 events/4
>> root 30 2 0 Jul02 ? 00:00:00 migration/5
>> root 31 2 0 Jul02 ? 00:00:00 ksoftirqd/5
>> root 32 2 0 Jul02 ? 00:00:00 watchdog/5
>> root 33 2 99 Jul02 ? 1184011132-12:04:39 events/5
>> root 34 2 0 Jul02 ? 00:00:00 migration/6
>> root 35 2 0 Jul02 ? 00:00:00 ksoftirqd/6
>> root 36 2 0 Jul02 ? 00:00:00 watchdog/6
>> root 37 2 99 Jul02 ? 1184011132-12:04:39 events/6
>> root 38 2 0 Jul02 ? 00:00:00 migration/7
>> root 39 2 0 Jul02 ? 00:00:00 ksoftirqd/7
>> root 40 2 0 Jul02 ? 00:00:00 watchdog/7
>> root 41 2 99 Jul02 ? 1184011132-12:04:39 events/7
>> root 42 2 0 Jul02 ? 00:00:00 sync_supers <https://forums.aws.amazon.com/>
>> root 43 2 99 Jul02 ? 1184011132-12:04:39 bdi-default<https://forums.aws.amazon.com/>
>> root 44 2 0 Jul02 ? 00:00:00 kintegrityd/0
>> root 45 2 0 Jul02 ? 00:00:00 kintegrityd/1
>> root 46 2 0 Jul02 ? 00:00:00 kintegrityd/2
>> root 47 2 0 Jul02 ? 00:00:00 kintegrityd/3
>> root 48 2 0 Jul02 ? 00:00:00 kintegrityd/4
>> root 49 2 0 Jul02 ? 00:00:00 kintegrityd/5
>> root 50 2 0 Jul02 ? 00:00:00 kintegrityd/6
>> root 51 2 0 Jul02 ? 00:00:00 kintegrityd/7
>> root 52 2 0 Jul02 ? 00:00:00 kblockd/0
>> root 53 2 0 Jul02 ? 00:00:00 kblockd/1
>> root 54 2 0 Jul02 ? 00:00:00 kblockd/2
>> root 55 2 0 Jul02 ? 00:00:00 kblockd/3
>> root 56 2 0 Jul02 ? 00:00:00 kblockd/4
>> root 57 2 0 Jul02 ? 00:00:00 kblockd/5
>> root 58 2 0 Jul02 ? 00:00:00 kblockd/6
>> root 59 2 0 Jul02 ? 00:00:00 kblockd/7
>> root 60 2 0 Jul02 ? 00:00:00 kseriod <https://forums.aws.amazon.com/>
>> root 70 2 0 Jul02 ? 00:00:00 khungtaskd <https://forums.aws.amazon.com/>
>> root 71 2 0 Jul02 ? 00:02:53 kswapd0 <https://forums.aws.amazon.com/>
>> root 72 2 0 Jul02 ? 00:00:00 aio/0
>> root 73 2 0 Jul02 ? 00:00:00 aio/1
>> root 74 2 0 Jul02 ? 00:00:00 aio/2
>> root 75 2 0 Jul02 ? 00:00:00 aio/3
>> root 76 2 0 Jul02 ? 00:00:00 aio/4
>> root 77 2 0 Jul02 ? 00:00:00 aio/5
>> root 78 2 0 Jul02 ? 00:00:00 aio/6
>> root 79 2 0 Jul02 ? 00:00:00 aio/7
>> root 80 2 0 Jul02 ? 00:00:00 jfsIO <https://forums.aws.amazon.com/>
>> root 81 2 0 Jul02 ? 00:00:00 jfsCommit <https://forums.aws.amazon.com/>
>> root 82 2 0 Jul02 ? 00:00:00 jfsCommit <https://forums.aws.amazon.com/>
>> root 83 2 0 Jul02 ? 00:00:00 jfsCommit <https://forums.aws.amazon.com/>
>> root 84 2 0 Jul02 ? 00:00:00 jfsCommit <https://forums.aws.amazon.com/>
>> root 85 2 0 Jul02 ? 00:00:00 jfsCommit <https://forums.aws.amazon.com/>
>> root 86 2 0 Jul02 ? 00:00:00 jfsCommit <https://forums.aws.amazon.com/>
>> root 87 2 0 Jul02 ? 00:00:00 jfsCommit <https://forums.aws.amazon.com/>
>> root 88 2 0 Jul02 ? 00:00:00 jfsCommit <https://forums.aws.amazon.com/>
>> root 89 2 0 Jul02 ? 00:00:00 jfsSync <https://forums.aws.amazon.com/>
>> root 90 2 0 Jul02 ? 00:00:00 xfs_mru_cache<https://forums.aws.amazon.com/>
>> root 91 2 0 Jul02 ? 00:00:02 xfslogd/0
>> root 92 2 0 Jul02 ? 00:00:00 xfslogd/1
>> root 93 2 0 Jul02 ? 00:00:00 xfslogd/2
>> root 94 2 0 Jul02 ? 00:00:00 xfslogd/3
>> root 95 2 0 Jul02 ? 00:00:00 xfslogd/4
>> root 96 2 0 Jul02 ? 00:00:00 xfslogd/5
>> root 97 2 0 Jul02 ? 00:00:00 xfslogd/6
>> root 98 2 0 Jul02 ? 00:00:00 xfslogd/7
>> root 99 2 0 Jul02 ? 00:00:35 xfsdatad/0
>> root 100 2 0 Jul02 ? 00:00:00 xfsdatad/1
>> root 101 2 0 Jul02 ? 00:00:00 xfsdatad/2
>> root 102 2 0 Jul02 ? 00:00:00 xfsdatad/3
>> root 103 2 0 Jul02 ? 00:00:00 xfsdatad/4
>> root 104 2 0 Jul02 ? 00:00:00 xfsdatad/5
>> root 105 2 0 Jul02 ? 00:00:00 xfsdatad/6
>> root 106 2 0 Jul02 ? 00:00:00 xfsdatad/7
>> root 107 2 0 Jul02 ? 00:00:00 xfsconvertd/0
>> root 108 2 0 Jul02 ? 00:00:00 xfsconvertd/1
>> root 109 2 0 Jul02 ? 00:00:00 xfsconvertd/2
>> root 110 2 0 Jul02 ? 00:00:00 xfsconvertd/3
>> root 111 2 0 Jul02 ? 00:00:00 xfsconvertd/4
>> root 112 2 0 Jul02 ? 00:00:00 xfsconvertd/5
>> root 113 2 0 Jul02 ? 00:00:00 xfsconvertd/6
>> root 114 2 0 Jul02 ? 00:00:00 xfsconvertd/7
>> root 115 2 0 Jul02 ? 00:00:00 glock_workqueue<https://forums.aws.amazon.com/>
>> root 116 2 0 Jul02 ? 00:00:00 glock_workqueue<https://forums.aws.amazon.com/>
>> root 117 2 0 Jul02 ? 00:00:00 glock_workqueue<https://forums.aws.amazon.com/>
>> root 118 2 0 Jul02 ? 00:00:00 glock_workqueue<https://forums.aws.amazon.com/>
>> root 119 2 0 Jul02 ? 00:00:00 glock_workqueue<https://forums.aws.amazon.com/>
>> root 120 2 0 Jul02 ? 00:00:00 glock_workqueue<https://forums.aws.amazon.com/>
>> root 121 2 0 Jul02 ? 00:00:00 glock_workqueue<https://forums.aws.amazon.com/>
>> root 122 2 0 Jul02 ? 00:00:00 glock_workqueue<https://forums.aws.amazon.com/>
>> root 123 2 0 Jul02 ? 00:00:00 delete_workqueu<https://forums.aws.amazon.com/>
>> root 124 2 0 Jul02 ? 00:00:00 delete_workqueu<https://forums.aws.amazon.com/>
>> root 125 2 0 Jul02 ? 00:00:00 delete_workqueu<https://forums.aws.amazon.com/>
>> root 126 2 0 Jul02 ? 00:00:00 delete_workqueu<https://forums.aws.amazon.com/>
>> root 127 2 0 Jul02 ? 00:00:00 delete_workqueu<https://forums.aws.amazon.com/>
>> root 128 2 0 Jul02 ? 00:00:00 delete_workqueu<https://forums.aws.amazon.com/>
>> root 129 2 0 Jul02 ? 00:00:00 delete_workqueu<https://forums.aws.amazon.com/>
>> root 130 2 0 Jul02 ? 00:00:00 delete_workqueu<https://forums.aws.amazon.com/>
>> root 131 2 0 Jul02 ? 00:00:00 kslowd000 <https://forums.aws.amazon.com/>
>> root 132 2 0 Jul02 ? 00:00:00 kslowd001 <https://forums.aws.amazon.com/>
>> root 133 2 0 Jul02 ? 00:00:00 crypto/0
>> root 134 2 0 Jul02 ? 00:00:00 crypto/1
>> root 135 2 0 Jul02 ? 00:00:00 crypto/2
>> root 136 2 0 Jul02 ? 00:00:00 crypto/3
>> root 137 2 0 Jul02 ? 00:00:00 crypto/4
>> root 138 2 0 Jul02 ? 00:00:00 crypto/5
>> root 139 2 0 Jul02 ? 00:00:00 crypto/6
>> root 140 2 0 Jul02 ? 00:00:00 crypto/7
>> root 143 2 0 Jul02 ? 00:00:00 net_accel/0
>> root 144 2 0 Jul02 ? 00:00:00 net_accel/1
>> root 145 2 0 Jul02 ? 00:00:00 net_accel/2
>> root 146 2 0 Jul02 ? 00:00:00 net_accel/3
>> root 147 2 0 Jul02 ? 00:00:00 net_accel/4
>> root 148 2 0 Jul02 ? 00:00:00 net_accel/5
>> root 149 2 0 Jul02 ? 00:00:00 net_accel/6
>> root 150 2 0 Jul02 ? 00:00:00 net_accel/7
>> root 151 2 0 Jul02 ? 00:00:00 sfc_netfront/0
>> root 152 2 0 Jul02 ? 00:00:00 sfc_netfront/1
>> root 153 2 0 Jul02 ? 00:00:00 sfc_netfront/2
>> root 154 2 0 Jul02 ? 00:00:00 sfc_netfront/3
>> root 155 2 0 Jul02 ? 00:00:00 sfc_netfront/4
>> root 156 2 0 Jul02 ? 00:00:00 sfc_netfront/5
>> root 157 2 0 Jul02 ? 00:00:00 sfc_netfront/6
>> root 158 2 0 Jul02 ? 00:00:00 sfc_netfront/7
>> root 159 2 0 Jul02 ? 00:00:00 kstriped <https://forums.aws.amazon.com/>
>> root 160 2 99 Jul02 ? 1184011132-12:04:39 kjournald<https://forums.aws.amazon.com/>
>> root 188 1 0 Jul02 ? 00:00:00 upstart-udev-bridge --daemon
>> root 190 1 0 Jul02 ? 00:00:00 udevd --daemon
>> root 243 190 0 Jul02 ? 00:00:00 udevd --daemon
>> root 244 190 0 Jul02 ? 00:00:00 udevd --daemon
>> root 349 1 0 Jul02 ? 00:00:00 dhclient3 -e IF_METRIC=100 -pf
>> /var/run/dhclient.eth0.pid -lf /var/lib/dhcp3/dhclient.eth0.leases eth0
>> root 408 2 99 Jul02 ? 1184011132-12:04:39 flush-8:1<https://forums.aws.amazon.com/>
>> root 514 2 99 Jul02 ? 1184011132-12:04:39 kjournald<https://forums.aws.amazon.com/>
>> syslog 521 1 0 Jul02 ? 00:00:04 rsyslogd -c4
>> root 528 1 0 Jul02 ? 00:00:01 /usr/sbin/sshd
>> 102 529 1 0 Jul02 ? 00:00:00 dbus-daemon --system --fork
>> avahi 557 1 0 Jul02 ? 00:00:00 avahi-daemon: running
>> http://ip-10-66-135-31.local
>> avahi 558 557 0 Jul02 ? 00:00:00 avahi-daemon: chroot helper
>> daemon 588 1 0 Jul02 ? 00:00:00 atd
>> root 589 1 0 Jul02 ? 00:00:02 cron
>> root 591 2 0 Jul05 ? 00:00:00 flush-8:112<https://forums.aws.amazon.com/>
>> root 673 1 0 Jul02 tty1 00:00:00 /sbin/getty -8 38400 tty1
>> root 2979 2 0 Jul02 ? 00:00:12 xfsbufd <https://forums.aws.amazon.com/>
>> root 2980 2 0 Jul02 ? 00:00:01 xfsaild <https://forums.aws.amazon.com/>
>> root 2981 2 99 Jul02 ? 1184011132-12:04:39 xfssyncd<https://forums.aws.amazon.com/>
>> root 2990 2 0 Jul02 ? 00:10:09 xfsbufd <https://forums.aws.amazon.com/>
>> root 2991 2 99 Jul02 ? 1184011132-12:04:39 xfsaild<https://forums.aws.amazon.com/>
>> root 2992 2 99 Jul02 ? 1184011132-12:04:39 xfssyncd<https://forums.aws.amazon.com/>
>> root 2997 2 0 Jul02 ? 00:00:10 xfsbufd <https://forums.aws.amazon.com/>
>> root 2998 2 0 Jul02 ? 00:00:01 xfsaild <https://forums.aws.amazon.com/>
>> root 2999 2 0 Jul02 ? 00:00:00 xfssyncd <https://forums.aws.amazon.com/>
>> root 3004 2 0 Jul02 ? 00:00:20 xfsbufd <https://forums.aws.amazon.com/>
>> root 3005 2 0 Jul02 ? 00:00:01 xfsaild <https://forums.aws.amazon.com/>
>> root 3006 2 99 Jul02 ? 1184011132-12:04:39 xfssyncd<https://forums.aws.amazon.com/>
>> root 3010 2 99 Jul02 ? 1184011132-12:04:39 xfsbufd<https://forums.aws.amazon.com/>
>> root 3011 2 99 Jul02 ? 1184011132-12:04:39 xfsaild<https://forums.aws.amazon.com/>
>> root 3012 2 0 Jul02 ? 00:00:00 xfssyncd <https://forums.aws.amazon.com/>
>> root 5105 1 99 Jul02 ? 1184011132-12:04:39 /usr/sbin/console-kit-daemon
>> --no-daemon
>> hadoop 5228 1 99 Jul02 ? 1184011132-12:04:39
>> /usr/lib/jvm/java-6-sun/bin/java -Dproc_datanode -Xmx1000m
>> -Dcom.sun.management.jmxremote -Dcom.sun.manage
>> hadoop 5337 1 0 Jul02 ? 00:00:27 /usr/lib/jvm/java-6-sun/bin/java
>> -Dproc_tasktracker -Xmx1000m -Dhadoop.log.dir=/var/log/hadoop/logs
>> -Dhadoop.log.file
>> nobody 5370 1 0 Jul02 ? 00:00:16 gmond start
>> root 7255 1 0 Jul02 ? 00:00:08 sendmail: MTA: accepting connections
>> hadoop 15940 5337 3 00:16 ? 00:00:20
>> /usr/lib/jvm/java-6-sun-1.6.0.22/jre/bin/java
>> -Djava.library.path=/usr/local/hadoop-0.20.2-cdh3u0/lib/native/Linux-am
>> hadoop 15956 5337 99 00:16 ? 00:13:45
>> /usr/lib/jvm/java-6-sun-1.6.0.22/jre/bin/java
>> -Djava.library.path=/usr/local/hadoop-0.20.2-cdh3u0/lib/native/Linux-am
>> root 16458 2 0 00:21 ? 00:00:00 flush-8:176<https://forums.aws.amazon.com/>
>> root 16459 2 0 00:21 ? 00:00:00 flush-8:192<https://forums.aws.amazon.com/>
>> root 16466 2 0 00:22 ? 00:00:00 flush-8:160<https://forums.aws.amazon.com/>
>> root 16472 528 0 00:23 ? 00:00:00 sshd: hadoop priv<https://forums.aws.amazon.com/>
>> root 16474 528 0 00:23 ? 00:00:00 sshd: hadoop priv<https://forums.aws.amazon.com/>
>> hadoop 17797 16472 0 00:26 ? 00:00:00 sshd: hadoop@pts/0
>> hadoop 17798 16474 0 00:26 ? 00:00:00 sshd: hadoop@pts/1
>> hadoop 17810 17798 0 00:26 pts/1 00:00:00 -bash
>> hadoop 17811 17797 0 00:26 pts/0 00:00:00 -bash
>> root 17873 2 0 00:26 ? 00:00:00 flush-8:16<https://forums.aws.amazon.com/>
>> hadoop 17907 5337 99 00:26 ? 00:00:01
>> /usr/lib/jvm/java-6-sun-1.6.0.22/jre/bin/java
>> -Djava.library.path=/usr/local/hadoop-0.20.2-cdh3u0/lib/native/Linux-am
>> hadoop 17933 17811 0 00:26 pts/0 00:00:00 ps -ef
>> hadoop 17938 17907 0 00:26 ? 00:00:00 ln <https://forums.aws.amazon.com/><defunct>
>> hadoop 17939 5337 0 00:26 ? 00:00:00 java<https://forums.aws.amazon.com/>
>> hadoop@ip-10-66-135-31:~$ iostat
>> Linux 2.6.32-309-ec2 (ip-10-66-135-31) 07/06/2011 _x86_64_ (8 CPU)
>>
>> avg-cpu: %user %nice %system %iowait %steal %idle
>> 6.78 0.00 0.83 0.64 0.32 91.43
>>
>> Device: tps Blk_read/s Blk_wrtn/s Blk_read Blk_wrtn
>> sda1 2.17 7.81 18.40 2506610 5903424
>> sdb 20.03 292.95 1302.75 94015994 418085704
>> sdj 15.17 953.59 326.60 306031633 104814604
>> sdk 15.24 947.01 339.31 303917473 108891811
>> sdl 14.27 872.23 330.64 279919466 106109534
>> sdm 14.84 930.28 321.98 298548801 103332735
>> sdh 0.59 0.15 7.83 48278 2513230
>>
>>
>>
>

Mime
View raw message