Mailing-List: contact user-help@hbase.apache.org; run by ezmlm
Precedence: bulk
Reply-To: user@hbase.apache.org
Received-SPF: pass (nike.apache.org: domain of gstathis@gmail.com designates
 74.125.82.169 as permitted sender)
DomainKey-Signature: a=rsa-sha1; c=nofws;
        d=gmail.com; s=gamma;
        h=mime-version:sender:in-reply-to:references:from:date
         :x-google-sender-auth:message-id:subject:to:content-type
         :content-transfer-encoding;
        b=EagN+nid1jmoDNJ15Snbwn5z6geaU5A8P94pkdQ8Q7qlONNek758YTVMtyfUz1mg4e
         At6Jz7wUVBTJ2dErDG6pn9QeGE+vzNg4FRzIIUEOEGF74EfD9CW1WuVhwHBCFknH1b4x
         jTRPP/mplYuCAsm5juXynfkEdK/csg25gBruc=
MIME-Version: 1.0
Sender: gstathis@gmail.com
In-Reply-To: <AANLkTi=s1Gby7o0-BKJ5UQ-6np=xniRXzKkYb9E8OKj6@mail.gmail.com>
References: <AANLkTi=2PaK1oS9scTtYQOeEmERTbtBU-_Ybxq-=aB7m@mail.gmail.com>
 <AANLkTi=s1Gby7o0-BKJ5UQ-6np=xniRXzKkYb9E8OKj6@mail.gmail.com>
From: "George P. Stathis" <gstathis@traackr.com>
Date: Tue, 17 Aug 2010 18:49:05 -0400
Message-ID: <AANLkTin4zOjd+LpsZHNrhxM-MFOTO_q43L36wrL4ow0N@mail.gmail.com>
Subject: Re: High OS Load Numbers when idle
To: user@hbase.apache.org
Content-Type: text/plain; charset=ISO-8859-1
Content-Transfer-Encoding: quoted-printable

Actually, there is nothing in %wa but a ton sitting in %id. This is
from the Master:

top - 18:30:24 up 5 days, 20:10, =A01 user, =A0load average: 2.55, 1.99, 1.=
25
Tasks: =A089 total, =A0 1 running, =A088 sleeping, =A0 0 stopped, =A0 0 zom=
bie
Cpu(s): =A00.0%us, =A00.0%sy, =A00.0%ni, 99.8%id, =A00.0%wa, =A00.0%hi, =A0=
0.0%si, =A00.2%st
Mem: =A017920228k total, =A02795464k used, 15124764k free, =A0 248428k buff=
ers
Swap: =A0 =A0 =A0 =A00k total, =A0 =A0 =A0 =A00k used, =A0 =A0 =A0 =A00k fr=
ee, =A01398388k cached

I have atop installed which is reporting the hadoop/hbase java daemons
as the most active processes (barely taking any CPU time though and
most of the time in sleep mode):

ATOP - domU-12-31-39-18-1 2010/08/17 =A018:31:46 =A0 =A0 =A0 =A0 =A0 =A0 =
=A0 10 seconds elapsed
PRC | sys =A0 0.01s | user =A0 0.00s | #proc =A0 =A0 89 | #zombie =A0 =A00 =
| #exit =A0 =A0 =A00 |
CPU | sys =A0 =A0 =A00% | user =A0 =A0 =A00% | irq =A0 =A0 =A0 0% | idle =
=A0 =A0200% | wait =A0 =A0 =A00% |
cpu | sys =A0 =A0 =A00% | user =A0 =A0 =A00% | irq =A0 =A0 =A0 0% | idle =
=A0 =A0100% | cpu000 w =A00% |
CPL | avg1 =A0 2.55 | avg5 =A0 =A02.12 | avg15 =A0 1.35 | csw =A0 =A0 2397 =
| intr =A0 =A02034 |
MEM | tot =A0 17.1G | free =A0 14.4G | cache =A0 1.3G | buff =A0242.6M | sl=
ab =A0193.1M |
SWP | tot =A0 =A00.0M | free =A0 =A00.0M | =A0 =A0 =A0 =A0 =A0 =A0 =A0| vmc=
om =A0 1.6G | vmlim =A0 8.5G |
NET | transport =A0 | tcpi =A0 =A0 330 | tcpo =A0 =A0 169 | udpi =A0 =A0 56=
6 | udpo =A0 =A0 147 |
NET | network =A0 =A0 | ipi =A0 =A0 =A0896 | ipo =A0 =A0 =A0316 | ipfrw =A0=
 =A0 =A00 | deliv =A0 =A0896 |
NET | eth0 =A0 ---- | pcki =A0 =A0 777 | pcko =A0 =A0 197 | si =A0248 Kbps =
| so =A0 70 Kbps |
NET | lo =A0 =A0 ---- | pcki =A0 =A0 119 | pcko =A0 =A0 119 | si =A0 =A09 K=
bps | so =A0 =A09 Kbps |

  PID  CPU COMMAND-LINE                                                  1/=
1
17613   0% atop
17150   0% /usr/lib/jvm/java-6-sun/bin/java -Xmx2048m -XX:+HeapDumpOnOutOfM=
emor
16527   0% /usr/lib/jvm/java-6-sun/bin/java -Xmx2048m -server -Dcom.sun.man=
agem
16839   0% /usr/lib/jvm/java-6-sun/bin/java -Xmx2048m -server -Dcom.sun.man=
agem
16735   0% /usr/lib/jvm/java-6-sun/bin/java -Xmx2048m -server -Dcom.sun.man=
agem
17083   0% /usr/lib/jvm/java-6-sun/bin/java -Xmx2048m -XX:+HeapDumpOnOutOfM=
emor

Same with atop:

  PID USER     PRI  NI  VIRT   RES   SHR S CPU% MEM%   TIME+  Command
16527 ubuntu    20   0 2352M   98M 10336 S  0.0  0.6  0:42.05
/usr/lib/jvm/java-6-sun/bin/java -Xmx2048m -server
-Dcom.sun.management.jmxremote -Dcom.sun.management.jmxremote
-Dhadoop.log.dir=3D/var/log/h
16735 ubuntu    20   0 2403M 81544 10236 S  0.0  0.5  0:01.56
/usr/lib/jvm/java-6-sun/bin/java -Xmx2048m -server
-Dcom.sun.management.jmxremote -Dcom.sun.management.jmxremote
-Dhadoop.log.dir=3D/var/log/h
17083 ubuntu    20   0 4557M 45388 10912 S  0.0  0.3  0:00.65
/usr/lib/jvm/java-6-sun/bin/java -Xmx2048m
-XX:+HeapDumpOnOutOfMemoryError -XX:+UseConcMarkSweepGC
-XX:+CMSIncrementalMode -server -XX:+Heap
    1 root      20   0 23684  1880  1272 S  0.0  0.0  0:00.23 /sbin/init
  587 root      20   0  247M  4088  2432 S  0.0  0.0 -596523h-14:-8
/usr/sbin/console-kit-daemon --no-daemon
 3336 root      20   0 49256  1092   540 S  0.0  0.0  0:00.36 /usr/sbin/ssh=
d
16430 nobody    20   0 34408  3704  1060 S  0.0  0.0  0:00.01 gmond
17150 ubuntu    20   0 2519M  112M 11312 S  0.0  0.6 -596523h-14:-8
/usr/lib/jvm/java-6-sun/bin/java -Xmx2048m
-XX:+HeapDumpOnOutOfMemoryError -XX:+UseConcMarkSweepGC
-XX:+CMSIncrementalMode -server -XX


So I'm a bit perplexed. Are there any hadoop / hbase specific tricks
that can reveal what these processes are doing?

-GS


On Tue, Aug 17, 2010 at 6:14 PM, Jean-Daniel Cryans <jdcryans@apache.org> w=
rote:
>
> It's not normal, but then again I don't have access to your machines
> so I can only speculate.
>
> Does "top" show you which process is in %wa? If so and it's a java
> process, can you figure what's going on in there?
>
> J-D
>
> On Tue, Aug 17, 2010 at 11:03 AM, George Stathis <gstathis@gmail.com> wro=
te:
> > Hello,
> >
> > We have just setup a new cluster on EC2 using Hadoop 0.20.2 and HBase
> > 0.20.3. Our small setup as of right now consists of one master and four
> > slaves with a replication factor of 2:
> >
> > Master: xLarge instance with 2 CPUs and 17.5 GB RAM - runs 1 namenode, =
1
> > secondarynamenode, 1 jobtracker, 1 hbasemaster, 1 zookeeper (uses its' =
own
> > dedicated EMS drive)
> > Slaves: xLarge instance with 2 CPUs and 17.5 GB RAM each - run 1 datano=
de, 1
> > tasktracker, 1 regionserver
> >
> > We have also installed Ganglia to monitor the cluster stats as we are a=
bout
> > to run some performance tests but, right out of the box, we are noticin=
g
> > high system loads (especially on the master node) without any activity
> > happening on the clister. Of course, the CPUs are not being utilized at=
 all,
> > but Ganglia is reporting almost all nodes in the red as the 1, 5 an 15
> > minute load times are all above 100% most of the time (i.e. there are m=
ore
> > than two processes at a time competing for the 2 CPUs time).
> >
> > Question1: is this normal?
> > Question2: does it matter since each process barely uses any of the CPU
> > time?
> >
> > Thank you in advance and pardon the noob questions.
> >
> > -GS
> >