hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Todd Lipcon <t...@cloudera.com>
Subject Re: High load on datanode startup
Date Fri, 11 May 2012 09:32:30 GMT
On Fri, May 11, 2012 at 2:29 AM, Darrell Taylor
<darrell.taylor@gmail.com> wrote:
>
> What I saw on the machine was thousands of recursive processes in ps of the
> form 'bash /usr/bin/hbase classpath...',  Stopping everything didn't clean
> the processes up so had to kill them manually with some grep/xargs foo.
>  Once this was all cleaned up and the hadoop-env.sh file removed the nodes
> seem to be happy again.

Ah -- maybe the issue is that... my guess is that "hbase classpath" is
now trying to include the Hadoop dependencies using "hadoop
classpath". But "hadoop classpath" was recursing right back because of
that setting in hadoop-env. Basically you made a fork bomb - that
explains the shape of the graph in Ganglia perfectly.

-Todd

>
> Darrell.
>
>
>>
>> Raj
>>
>>
>>
>> >________________________________
>> > From: Darrell Taylor <darrell.taylor@gmail.com>
>> >To: common-user@hadoop.apache.org
>> >Cc: Raj Vishwanathan <rajvish@yahoo.com>
>> >Sent: Thursday, May 10, 2012 3:57 AM
>> >Subject: Re: High load on datanode startup
>> >
>> >On Thu, May 10, 2012 at 9:33 AM, Todd Lipcon <todd@cloudera.com> wrote:
>> >
>> >> That's real weird..
>> >>
>> >> If you can reproduce this after a reboot, I'd recommend letting the DN
>> >> run for a minute, and then capturing a "jstack <pid of dn>" as well
as
>> >> the output of "top -H -p <pid of dn> -b -n 5" and send it to the list.
>> >
>> >
>> >What I did after the reboot this morning was to move the my dn, nn, and
>> >mapred directories out of the the way, create a new one, formatted it, and
>> >restarted the node, it's now happy.
>> >
>> >I'll try moving the directories back later and do the jstack as you
>> suggest.
>> >
>> >
>> >>
>> >> What JVM/JDK are you using? What OS version?
>> >>
>> >
>> >root@pl446:/# dpkg --get-selections | grep java
>> >java-common                                     install
>> >libjaxp1.3-java                                 install
>> >libjaxp1.3-java-gcj                             install
>> >libmysql-java                                   install
>> >libxerces2-java                                 install
>> >libxerces2-java-gcj                             install
>> >sun-java6-bin                                   install
>> >sun-java6-javadb                                install
>> >sun-java6-jdk                                   install
>> >sun-java6-jre                                   install
>> >
>> >root@pl446:/# java -version
>> >java version "1.6.0_26"
>> >Java(TM) SE Runtime Environment (build 1.6.0_26-b03)
>> >Java HotSpot(TM) 64-Bit Server VM (build 20.1-b02, mixed mode)
>> >
>> >root@pl446:/# cat /etc/issue
>> >Debian GNU/Linux 6.0 \n \l
>> >
>> >
>> >
>> >>
>> >> -Todd
>> >>
>> >>
>> >> On Wed, May 9, 2012 at 11:57 PM, Darrell Taylor
>> >> <darrell.taylor@gmail.com> wrote:
>> >> > On Wed, May 9, 2012 at 10:52 PM, Raj Vishwanathan <rajvish@yahoo.com>
>> >> wrote:
>> >> >
>> >> >> The picture either too small or too pixelated for my eyes :-)
>> >> >>
>> >> >
>> >> > There should be a zoom option in the top right of the page that allows
>> >> you
>> >> > to view it full size
>> >> >
>> >> >
>> >> >>
>> >> >> Can you login to the box and send the output of top? If the system
is
>> >> >> unresponsive, it has to be something more than an unbalanced hdfs
>> >> cluster,
>> >> >> methinks.
>> >> >>
>> >> >
>> >> > Sorry, I'm unable to login to the box, it's completely unresponsive.
>> >> >
>> >> >
>> >> >>
>> >> >> Raj
>> >> >>
>> >> >>
>> >> >>
>> >> >> >________________________________
>> >> >> > From: Darrell Taylor <darrell.taylor@gmail.com>
>> >> >> >To: common-user@hadoop.apache.org; Raj Vishwanathan <
>> rajvish@yahoo.com
>> >> >
>> >> >> >Sent: Wednesday, May 9, 2012 2:40 PM
>> >> >> >Subject: Re: High load on datanode startup
>> >> >> >
>> >> >> >On Wed, May 9, 2012 at 10:23 PM, Raj Vishwanathan <
>> rajvish@yahoo.com>
>> >> >> wrote:
>> >> >> >
>> >> >> >> When you say 'load', what do you mean? CPU load or something
else?
>> >> >> >>
>> >> >> >
>> >> >> >I mean in the unix sense of load average, i.e. top would show
a
>> load of
>> >> >> >(currently) 376.
>> >> >> >
>> >> >> >Looking at Ganglia stats for the box it's not CPU load as such,
the
>> >> graphs
>> >> >> >shows actual CPU usage as 30%, but the number of running processes
>> is
>> >> >> >simply growing in a linear manner - screen shot of ganglia
page
>> here :
>> >> >> >
>> >> >> >
>> >> >>
>> >>
>> https://picasaweb.google.com/lh/photo/Q0uFSzyLiriDuDnvyRUikXVR0iWwMibMfH0upnTwi28?feat=directlink
>> >> >> >
>> >> >> >
>> >> >> >
>> >> >> >>
>> >> >> >> Raj
>> >> >> >>
>> >> >> >>
>> >> >> >>
>> >> >> >> >________________________________
>> >> >> >> > From: Darrell Taylor <darrell.taylor@gmail.com>
>> >> >> >> >To: common-user@hadoop.apache.org
>> >> >> >> >Sent: Wednesday, May 9, 2012 9:52 AM
>> >> >> >> >Subject: High load on datanode startup
>> >> >> >> >
>> >> >> >> >Hi,
>> >> >> >> >
>> >> >> >> >I wonder if someone could give some pointers with
a problem I'm
>> >> having?
>> >> >> >> >
>> >> >> >> >I have a 7 machine cluster setup for testing and we
have been
>> >> pouring
>> >> >> data
>> >> >> >> >into it for a week without issue, have learnt several
thing along
>> >> the
>> >> >> way
>> >> >> >> >and solved all the problems up to now by searching
online, but
>> now
>> >> I'm
>> >> >> >> >stuck.  One of the data nodes decided to have a load
of 70+ this
>> >> >> morning,
>> >> >> >> >stopping datanode and tasktracker brought it back
to normal, but
>> >> every
>> >> >> >> time
>> >> >> >> >I start the datanode again the load shoots through
the roof, and
>> >> all I
>> >> >> get
>> >> >> >> >in the logs is :
>> >> >> >> >
>> >> >> >> >STARTUP_MSG: Starting DataNode
>> >> >> >> >
>> >> >> >> >
>> >> >> >> >STARTUP_MSG:   host = pl464/10.20.16.64
>> >> >> >> >
>> >> >> >> >
>> >> >> >> >STARTUP_MSG:   args = []
>> >> >> >> >
>> >> >> >> >
>> >> >> >> >STARTUP_MSG:   version = 0.20.2-cdh3u3
>> >> >> >> >
>> >> >> >> >
>> >> >> >> >STARTUP_MSG:   build =
>> >> >> >>
>> >> >> >>
>> >> >>
>> >>
>> >file:///data/1/tmp/nightly_2012-03-20_13-13-48_3/hadoop-0.20-0.20.2+923.197-1~squeeze
>> >> >> >> >-************************************************************/
>> >> >> >> >
>> >> >> >> >
>> >> >> >> >2012-05-09 16:12:05,925 INFO
>> >> >> >> >org.apache.hadoop.security.UserGroupInformation: JAAS
>> Configuration
>> >> >> >> already
>> >> >> >> >set up for Hadoop, not re-installing.
>> >> >> >> >
>> >> >> >> >2012-05-09 16:12:06,139 INFO
>> >> >> >> >org.apache.hadoop.security.UserGroupInformation: JAAS
>> Configuration
>> >> >> >> already
>> >> >> >> >set up for Hadoop, not re-installing.
>> >> >> >> >
>> >> >> >> >Nothing else.
>> >> >> >> >
>> >> >> >> >The load seems to max out only 1 of the CPUs, but
the machine
>> >> becomes
>> >> >> >> >*very* unresponsive
>> >> >> >> >
>> >> >> >> >Anybody got any pointers of things I can try?
>> >> >> >> >
>> >> >> >> >Thanks
>> >> >> >> >Darrell.
>> >> >> >> >
>> >> >> >> >
>> >> >> >> >
>> >> >> >>
>> >> >> >
>> >> >> >
>> >> >> >
>> >> >>
>> >>
>> >>
>> >>
>> >> --
>> >> Todd Lipcon
>> >> Software Engineer, Cloudera
>> >>
>> >
>> >
>> >
>>



-- 
Todd Lipcon
Software Engineer, Cloudera

Mime
View raw message