hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From john smith <js1987.sm...@gmail.com>
Subject Re: Datanodes going down frequently
Date Fri, 16 Sep 2011 17:03:57 GMT
Hi Aaron,

I haven't really run any MR jobs on my cluster till now. I've just been
pushing data into the hdfs . So network shouldn't be a problem.

Initially my HADOOP_HEAPSIZE was set to 2000MB and my ram size was 2GB .
This resulted in datanodes going down randomly. I actually realized that the
OS kept crashing and system went unresponsive until I manually power it on
again.

So I reduced the HADOOP_HEAPSIZE to 800MB and the cluster seems to be stable
again and the datanodes are stable from the past few hours.(I am not sure
though,I need to run a few heavy tasks to check it thoroughly).

Looks like my problem wasn't with ethernet interface going down and its
actually a full OS crash. I am not used to KVM , so i'll have to google it
and i'll attach it to the datanodes and watch them closely incase they fail
again in the future.

What abt your cluster? Are you running any "suffle intense" jobs like JOINs
or CROSS PRODUCTs ?

Thanks

On Fri, Sep 16, 2011 at 10:16 PM, Aaron Baff <Aaron.Baff@telescope.tv>wrote:

> John,
>
> Are the machines simply unreachable? Or has the OS crashed? We've been
> having quite a few problems with our network mbufs filling up and not
> getting released, which causes a machine to eventually become unreachable
> via the network, although they are otherwise up and running fine. Can you
> attach a KVM to a machine when it becomes unreachable and take a look? Or
> add some monitoring to keep an eye on the network mbufs? Don't know if this
> is your problem as well or not.
>
> --Aaron
> -----Original Message-----
> From: john smith [mailto:js1987.smith@gmail.com]
> Sent: Thursday, September 15, 2011 9:46 PM
> To: common-user@hadoop.apache.org
> Subject: Re: Datanodes going down frequently
>
> Hi All,
>
> Thanks for your inputs,
>
> @Aaron : No, they aren't recovering. They are losing network connectivity
> and they are not getting it back. I am unable to ssh to them and I need to
> manually go and restart the networking.
>
> @harsh and Raj,
>
> One thing I noticed in my hadoop-env.sh that  "export HADOOP_HEAPSIZE=2000"
> . Isn't this strange? Allocating my whole ram to the JVM ? Should I
> consider
> this? Right now I am not running any MR jobs as such .
>
> I've started my cluster and I've put around 30 to 40GB of data with a
> replication factor of 3 . This takes the machines down. Looks like swapping
> issue .. But how to see if I am swapping or not? Any help?
>
> Thanks
> jS
>
> On Fri, Sep 16, 2011 at 10:03 AM, Harsh J <harsh@cloudera.com> wrote:
>
> > I bet its swapping. You may just be oversubscribing those machines
> > with your MR slots and heap per slot or otherwise. Could also be low
> > heap given number of blocks its gotta report (which would equate to a
> > small files issue given your cluster size possibly, but that's a
> > different discussion).
> >
> > On Fri, Sep 16, 2011 at 3:36 AM, john smith <js1987.smith@gmail.com>
> > wrote:
> > > Hi all,
> > >
> > > I am running a 10 node cluster (1NN + 9DN, ubuntu server 10.04, 2GB RAM
> > > each). I am facing a strange problem. My datanodes go down randomly and
> > > nothing showup in the logs. They lose their network connectivity
> suddenly
> > > and NN declares them as dead. Any one faced this problem? Is it because
> > of
> > > hadoop or is it some problem with my infrastructure?
> > >
> > > The worst part of the problem is, I need to manually go to the remote
> > > machine and restart networking. Can someone help me with this? Did any
> > one
> > > face a similar kind of a problem
> > >
> > > Btw: my had version : 0.20.2
> > >
> > > Thanks,
> > > jS
> > >
> >
> >
> >
> > --
> > Harsh J
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message