Mailing-List: contact user-help@hbase.apache.org; run by ezmlm
Precedence: bulk
Reply-To: user@hbase.apache.org
Received-SPF: pass (nike.apache.org: domain of leogamas@jusbrasil.com.br
 designates 74.125.82.169 as permitted sender)
MIME-Version: 1.0
In-Reply-To: 
 <CAOKsKJU62GHw6wZjK9yxL_P8XpSUT9cTTMcjr7tt_4QYk4U+0w@mail.gmail.com>
References: 
 <CAKS9REK5cOGV6Vykr1uBto5m0-Cgh1i068tddKbrgVxRNp1oMA@mail.gmail.com>
	<A0B699C30D50C84883E79752AEA74B8830724245C7@NAMBX01.corp.adobe.com>
	<CAKS9REKk0beSSt1B9_10cofP-MhADj-m7QFsm+JavzZE63R51Q@mail.gmail.com>
	<CAKS9REJcBjgnj_b9s+zBThSAovFoRTTgNffEi_Ob-1K69e97nw@mail.gmail.com>
	<CADcMMgE+Yzr=_sTdOVOp7p+BEpYSP9gXDxWPqHHVLW_2haXUnA@mail.gmail.com>
	<001d01cccb7e$4cb35240$e619f6c0$%vasudevan@huawei.com>
	<CAKS9RELqSTEv_fhyQa9HGdr0OK1Rv3NbT+Z9JrvhUP=OnvGM+g@mail.gmail.com>
	<A0B699C30D50C84883E79752AEA74B883072576C94@NAMBX01.corp.adobe.com>
	<CAKS9RE+nNn_vaC_zKw8AxvxGPoaQBvSbTYxTjFNQBKUkpsQ=mg@mail.gmail.com>
	<A0B699C30D50C84883E79752AEA74B883072576D6B@NAMBX01.corp.adobe.com>
	<CAKS9RELcrGr5jMjKmQ7y_oKmAHyW21Wvn1HgDVFRE-QAoJcudA@mail.gmail.com>
	<CAKS9REJXGXR9CAnD5foPFvoPB-tOKeu8nzbkeo5GeZPv-=jy9A@mail.gmail.com>
	<CAL4Pj0q8Ta_ncfLHuMTGeEdG+M1vWnSAEdQTQx2dq=KcmJWr4Q@mail.gmail.com>
	<CAKS9RELT=6BXd8rHZt435+KBLTPK-RXqtsywG3FRBppFj1D9eQ@mail.gmail.com>
	<CAOKsKJU62GHw6wZjK9yxL_P8XpSUT9cTTMcjr7tt_4QYk4U+0w@mail.gmail.com>
Date: Sat, 21 Jan 2012 11:04:22 -0200
Message-ID: 
 <CAKS9REKqUkVEph4BvxOpNP5vHqXv21S0a=nA9NNQMs1__JqGgg@mail.gmail.com>
Subject: Re: RegionServer dying every two or three days
From: Leonardo Gamas <leogamas@jusbrasil.com.br>
To: user@hbase.apache.org
Content-Type: multipart/alternative; boundary=0016e6dedde1adca4104b7096fa8

--0016e6dedde1adca4104b7096fa8
Content-Type: text/plain; charset=windows-1252
Content-Transfer-Encoding: quoted-printable

Thanks Matt for this insightful article, I will run my cluster with
c1.xlarge to test it's performance. But i'm concerned with this machine,
because the amount of RAM available, only 7GB. How many map/reduce slots do
you configure? And the amount of Heap for HBase? How many regions per
RegionServer could my cluster support?

2012/1/20 Matt Corgan <mcorgan@hotpads.com>

> I run c1.xlarge servers and have found them very stable.  I see 100 Mbit/=
s
> sustained bi-directional network throughput (200Mbit/s total), sometimes =
up
> to 150 * 2 Mbit/s.
>
> Here's a pretty thorough examination of the underlying hardware:
>
>
> http://huanliu.wordpress.com/2010/06/14/amazons-physical-hardware-and-ec2=
-compute-unit/
>
>
> *High-CPU instances*
>
> The high-CPU instances (c1.medium, c1.xlarge) run on systems with
> dual-socket Intel Xeon E5410 2.33GHz processors. It is dual-socket becaus=
e
> we see APIC IDs 0 to 7, and E5410 only has 4 cores. A c1.xlarge instance
> almost takes up the whole physical machine. However, we frequently observ=
e
> steal cycle on a c1.xlarge instance ranging from 0% to 25% with an averag=
e
> of about 10%. The amount of steal cycle is not enough to host another
> smaller VM, i.e., a c1.medium. Maybe those steal cycles are used to run
> Amazon=92s software firewall (security group). On Passmark-CPU mark, a
> c1.xlarge machine achieves 7,962.6, actually higher than an average
> dual-sock E5410 system is able to achieve (average is 6,903).
>
>
>
> On Fri, Jan 20, 2012 at 8:03 AM, Leonardo Gamas
> <leogamas@jusbrasil.com.br>wrote:
>
> > Thanks Neil for sharing your experience with AWS! Could you tell what
> > instance type are you using?
> > We are using m1.xlarge, that has 4 virtual cores, but i normally see
> > recommendations for machines with 8 cores like c1.xlarge, m2.4xlarge,
> etc.
> > In principle these 8-core machines don't suffer too much with I/O
> problems
> > since they don't share the physical server. Is there any piece of
> > information from Amazon or other source that affirms that or it's based
> in
> > empirical analysis?
> >
> > 2012/1/19 Neil Yalowitz <neilyalowitz@gmail.com>
> >
> > > We have experienced many problems with our cluster on EC2.  The blunt
> > > solution was to increase the Zookeeper timeout to 5 minutes or even
> more.
> > >
> > > Even with a long timeout, however, it's not uncommon for us to see an
> EC2
> > > instance to become unresponsive to pings and SSH several times during=
 a
> > > week.  It's been a very bad environment for clusters.
> > >
> > >
> > > Neil
> > >
> > > On Thu, Jan 19, 2012 at 11:49 AM, Leonardo Gamas
> > > <leogamas@jusbrasil.com.br>wrote:
> > >
> > > > Hi Guys,
> > > >
> > > > I have tested the parameters provided by Sandy, and it solved the G=
C
> > > > problems with the -XX:+UseParallelOldGC, thanks for the help Sandy.
> > > > I'm still experiencing some difficulties, the RegionServer continue=
s
> to
> > > > shutdown, but it seems related to I/O. It starts to timeout many
> > > > connections, new connections to/from the machine timeout too, and
> > finally
> > > > the RegionServer dies because of YouAreDeadException. I will collec=
t
> > more
> > > > data, but i think it's an Amazon/Virtualized Environment inherent
> > issue.
> > > >
> > > > Thanks for the great help provided so far.
> > > >
> > > > 2012/1/5 Leonardo Gamas <leogamas@jusbrasil.com.br>
> > > >
> > > > > I don't think so, if Amazon stopped the machine it would cause a
> stop
> > > of
> > > > > minutes, not seconds, and since the DataNode, TaskTracker and
> > Zookeeper
> > > > > continue to work normally.
> > > > > But it can be related to the shared environment nature of Amazon,
> > maybe
> > > > > some spike in I/O caused by another virtualized server in the sam=
e
> > > > physical
> > > > > machine.
> > > > >
> > > > > But the intance type i'm using:
> > > > >
> > > > > *Extra Large Instance*
> > > > >
> > > > > 15 GB memory
> > > > > 8 EC2 Compute Units (4 virtual cores with 2 EC2 Compute Units eac=
h)
> > > > > 1,690 GB instance storage
> > > > > 64-bit platform
> > > > > I/O Performance: High
> > > > > API name: m1.xlarge
> > > > > I was not expecting to suffer from this problems, or at least not
> > much.
> > > > >
> > > > >
> > > > > 2012/1/5 Sandy Pratt <prattrs@adobe.com>
> > > > >
> > > > >> You think it's an Amazon problem maybe?  Like they paused or
> > migrated
> > > > >> your virtual machine, and it just happens to be during GC, leavi=
ng
> > us
> > > to
> > > > >> think the GC ran long when it didn't?  I don't have a lot of
> > > experience
> > > > >> with Amazon so I don't know if that sort of thing is common.
> > > > >>
> > > > >> > -----Original Message-----
> > > > >> > From: Leonardo Gamas [mailto:leogamas@jusbrasil.com.br]
> > > > >> > Sent: Thursday, January 05, 2012 13:15
> > > > >> > To: user@hbase.apache.org
> > > > >> > Subject: Re: RegionServer dying every two or three days
> > > > >> >
> > > > >> > I checked the CPU Utilization graphics provided by Amazon (it'=
s
> > not
> > > > >> accurate,
> > > > >> > since the sample time is about 5 minutes) and don't see any
> > > > >> abnormality. I
> > > > >> > will setup TSDB with Nagios to have a more reliable source of
> > > > >> performance
> > > > >> > data.
> > > > >> >
> > > > >> > The machines don't have swap space, if i run:
> > > > >> >
> > > > >> > $ swapon -s
> > > > >> >
> > > > >> > To display swap usage summary, it returns an empty list.
> > > > >> >
> > > > >> > I will drop -XX:+DoEscapeAnalysis -XX:+AggressiveOpts in my to
> > > tests.
> > > > >> >
> > > > >> > I don't have payed much attention to the value of the new size
> > > param.
> > > > >> >
> > > > >> > Thanks again for the help!!
> > > > >> >
> > > > >> > 2012/1/5 Sandy Pratt <prattrs@adobe.com>
> > > > >> >
> > > > >> > > That size heap doesn't seem like it should cause a 36 second
> GC
> > (a
> > > > >> > > minor GC even if I remember your logs correctly), so I tend =
to
> > > think
> > > > >> > > that other things are probably going on.
> > > > >> > >
> > > > >> > > This line here:
> > > > >> > >
> > > > >> > > 14251.690: [GC 14288.620: [ParNew: 105352K->413K(118016K),
> > > 0.0361840
> > > > >> > > secs]
> > > > >> > > 954388K->849478K(1705776K), 0.0364200 secs] [Times: user=3D0=
.05
> > > > >> > > 954388K->sys=3D0.01,
> > > > >> > > real=3D36.96 secs]
> > > > >> > >
> > > > >> > > is really mysterious to me.  It seems to indicate that the
> > process
> > > > was
> > > > >> > > blocked for almost 37 seconds during a minor collection.  No=
te
> > the
> > > > CPU
> > > > >> > > times are very low but the wall time is very high.  If it wa=
s
> > > > actually
> > > > >> > > doing GC work, I'd expect to see user time higher than real
> > time,
> > > as
> > > > >> > > it is in other parallel collections (see your log snippet).
> >  Were
> > > > you
> > > > >> > > really so CPU starved that it took 37 seconds to get in 50ms
> of
> > > > work?
> > > > >> > > I can't make sense of that.  I'm trying to think of somethin=
g
> > that
> > > > >> > > would block you for that long while all your threads are
> stopped
> > > for
> > > > >> > > GC, other than being in swap, but I can't come up with
> anything.
> > > > >>  You're
> > > > >> > certain you're not in swap?
> > > > >> > >
> > > > >> > > Maybe drop these guys -XX:+DoEscapeAnalysis
> -XX:+AggressiveOpts
> > > > while
> > > > >> > > you troubleshoot?
> > > > >> > >
> > > > >> > > Why is your new size so small?  This generally means that
> > > relatively
> > > > >> > > more objects are being tenured than would be with a larger n=
ew
> > > size.
> > > > >> > > This could make collections of the old gen worse (GC time is
> > said
> > > to
> > > > >> > > be proportional to the number of live objects in the
> generation,
> > > and
> > > > >> > > CMS does indeed cause STW pauses).  A typical new to tenured
> > ratio
> > > > >> > > might be 1:3.  Were the new gen GCs taking too long?  This i=
s
> > > > probably
> > > > >> > > orthogonal to your immediate issue, though.
> > > > >> > >
> > > > >> > >
> > > > >> > >
> > > > >> > > -----Original Message-----
> > > > >> > > From: Leonardo Gamas [mailto:leogamas@jusbrasil.com.br]
> > > > >> > > Sent: Thursday, January 05, 2012 5:33 AM
> > > > >> > > To: user@hbase.apache.org
> > > > >> > > Subject: Re: RegionServer dying every two or three days
> > > > >> > >
> > > > >> > >  St.Ack,
> > > > >> > >
> > > > >> > > I don't have made any attempt in GC tunning, yet.
> > > > >> > > I will read the perf section as suggested.
> > > > >> > > I'm currently using Nagios + JMX to monitor the cluster, but
> > it's
> > > > >> > > currently used for alert only, the perfdata is not been
> stored,
> > so
> > > > >> > > it's kind of useless right now, but i was thinking in use TS=
DB
> > to
> > > > >> > > store it, any known case of integration?
> > > > >> > > ---
> > > > >> > >
> > > > >> > > Sandy,
> > > > >> > >
> > > > >> > > Yes, my timeout is 30 seconds:
> > > > >> > >
> > > > >> > > <property>
> > > > >> > >   <name>zookeeper.session.timeout</name>
> > > > >> > >   <value>30000</value>
> > > > >> > > </property>
> > > > >> > >
> > > > >> > > To our application it's a sufferable time to wait in case a
> > > > >> > > RegionServer go offline.
> > > > >> > >
> > > > >> > > My heap is 4GB and my JVM params are:
> > > > >> > >
> > > > >> > > -Xmx4096m -server -XX:+UseParNewGC -XX:+UseConcMarkSweepGC
> > > > >> > > -XX:CMSInitiatingOccupancyFraction=3D70 -XX:NewSize=3D128m
> > > > >> > > -XX:MaxNewSize=3D128m -XX:+DoEscapeAnalysis -XX:+AggressiveO=
pts
> > > > >> > > -verbose:gc -XX:+PrintGCDetails -XX:+PrintGCTimeStamps
> > > > >> > > -Xloggc:/usr/lib/hbase/logs/hbase-regionserver-gc.log
> > > > >> > >
> > > > >> > > I will try the -XX:+UseParallelOldGC param and post my
> feedback
> > > > here.
> > > > >> > > ---
> > > > >> > >
> > > > >> > > Ramkrishna,
> > > > >> > >
> > > > >> > > Seems the GC is the root of all evil in this case.
> > > > >> > > ----
> > > > >> > >
> > > > >> > > Thank you all for the answers. I will try out these valuable
> > > advices
> > > > >> > > given here and post my results.
> > > > >> > >
> > > > >> > > Leo Gamas.
> > > > >> > >
> > > > >> > > 2012/1/5 Ramkrishna S Vasudevan <
> > ramkrishna.vasudevan@huawei.com>
> > > > >> > >
> > > > >> > > > Recently we faced a similar problem and it was due to GC
> > config.
> > > > >> > > > Pls check your GC.
> > > > >> > > >
> > > > >> > > > Regards
> > > > >> > > > Ram
> > > > >> > > >
> > > > >> > > > -----Original Message-----
> > > > >> > > > From: saint.ack@gmail.com [mailto:saint.ack@gmail.com] On
> > > Behalf
> > > > Of
> > > > >> > > > Stack
> > > > >> > > > Sent: Thursday, January 05, 2012 2:50 AM
> > > > >> > > > To: user@hbase.apache.org
> > > > >> > > > Subject: Re: RegionServer dying every two or three days
> > > > >> > > >
> > > > >> > > > On Wed, Jan 4, 2012 at 12:17 PM, Leonardo Gamas
> > > > >> > > > <leogamas@jusbrasil.com.br> wrote:
> > > > >> > > > > The third line took 36.96 seconds to execute, can this b=
e
> > > > causing
> > > > >> > > > > this problem?
> > > > >> > > > >
> > > > >> > > >
> > > > >> > > > Probably.  Have you made any attempt at GC tuning?
> > > > >> > > >
> > > > >> > > >
> > > > >> > > > > Reading the code a little it seems that, even if it's
> > > disabled,
> > > > if
> > > > >> > > > > all files are target in a compaction, it's considered a
> > major
> > > > >> > > > > compaction. Is
> > > > >> > > > it
> > > > >> > > > > right?
> > > > >> > > > >
> > > > >> > > >
> > > > >> > > > That is right.  They get 'upgraded' from minor to major.
> > > > >> > > >
> > > > >> > > > This should be fine though.  What you are avoiding setting
> > major
> > > > >> > > > compactions to 0 is all regions being major compacted on a
> > > > period, a
> > > > >> > > > heavy weight effective rewrite of all your data (unless
> > already
> > > > >> major
> > > > >> > > > compacted).   It looks like you have this disabled which i=
s
> > good
> > > > >> until
> > > > >> > > > you've wrestled your cluster into submission.
> > > > >> > > >
> > > > >> > > >
> > > > >> > > > > The machines don't have swap, so the swappiness paramete=
r
> > > don't
> > > > >> > > > > seem to apply here. Any other suggestion?
> > > > >> > > > >
> > > > >> > > >
> > > > >> > > > See the perf section of the hbase manual.  It has our
> current
> > > > list.
> > > > >> > > >
> > > > >> > > > Are you monitoring your cluster w/ ganglia or tsdb?
> > > > >> > > >
> > > > >> > > >
> > > > >> > > > St.Ack
> > > > >> > > >
> > > > >> > > > > Thanks.
> > > > >> > > > >
> > > > >> > > > > 2012/1/4 Leonardo Gamas <leogamas@jusbrasil.com.br>
> > > > >> > > > >
> > > > >> > > > >> I will investigate this, thanks for the response.
> > > > >> > > > >>
> > > > >> > > > >>
> > > > >> > > > >> 2012/1/3 Sandy Pratt <prattrs@adobe.com>
> > > > >> > > > >>
> > > > >> > > > >>> 11/12/29 00:01:01 INFO zookeeper.ClientCnxn: Client
> > session
> > > > >> > > > >>> timed out, have not heard from server in 61103ms for
> > > sessionid
> > > > >> > > > >>> 0x23462a4cf93a8fc, closing socket connection and
> > attempting
> > > > >> > > > >>> reconnect
> > > > >> > > > >>>
> > > > >> > > > >>> It looks like the process has been unresponsive for so=
me
> > > time,
> > > > >> > > > >>> so ZK
> > > > >> > > > has
> > > > >> > > > >>> terminated the session.  Did you experience a long GC
> > pause
> > > > >> > > > >>> right
> > > > >> > > > before
> > > > >> > > > >>> this?  If you don't have GC logging enabled for the RS=
,
> > you
> > > > can
> > > > >> > > > sometimes
> > > > >> > > > >>> tell by noticing a gap in the timestamps of the log
> > > statements
> > > > >> > > > >>> leading
> > > > >> > > > up
> > > > >> > > > >>> to the crash.
> > > > >> > > > >>>
> > > > >> > > > >>> If it turns out to be GC, you might want to look at yo=
ur
> > > > kernel
> > > > >> > > > >>> swappiness setting (set it to 0) and your JVM params.
> > > > >> > > > >>>
> > > > >> > > > >>> Sandy
> > > > >> > > > >>>
> > > > >> > > > >>>
> > > > >> > > > >>> > -----Original Message-----
> > > > >> > > > >>> > From: Leonardo Gamas [mailto:
> leogamas@jusbrasil.com.br]
> > > > >> > > > >>> > Sent: Thursday, December 29, 2011 07:44
> > > > >> > > > >>> > To: user@hbase.apache.org
> > > > >> > > > >>> > Subject: RegionServer dying every two or three days
> > > > >> > > > >>> >
> > > > >> > > > >>> > Hi,
> > > > >> > > > >>> >
> > > > >> > > > >>> > I have a HBase/Hadoop CDH3 cluster with 4 machines (=
1
> > > > Master +
> > > > >> > > > >>> > 3
> > > > >> > > > >>> Slaves),
> > > > >> > > > >>> > running on Amazon EC2. The master is a High-Memory
> Extra
> > > > Large
> > > > >> > > > Instance
> > > > >> > > > >>> > (m2.xlarge) with NameNode, JobTracker, HMaster and
> > > > Zookeeper.
> > > > >> > > > >>> > The slaves are Extra Large Instances (m1.xlarge)
> running
> > > > >> > > > >>> > Datanode,
> > > > >> > > > >>> TaskTracker,
> > > > >> > > > >>> > RegionServer and Zookeeper.
> > > > >> > > > >>> >
> > > > >> > > > >>> > From time to time, every two or three days, one of t=
he
> > > > >> > > > >>> > RegionServers processes goes down, but the other
> > processes
> > > > >> > > > >>> > (DataNode, TaskTracker,
> > > > >> > > > >>> > Zookeeper) continue normally.
> > > > >> > > > >>> >
> > > > >> > > > >>> > Reading the logs:
> > > > >> > > > >>> >
> > > > >> > > > >>> > The connection with Zookeeper timed out:
> > > > >> > > > >>> >
> > > > >> > > > >>> > ---------------------------
> > > > >> > > > >>> > 11/12/29 00:01:01 INFO zookeeper.ClientCnxn: Client
> > > session
> > > > >> > > > >>> > timed
> > > > >> > > > out,
> > > > >> > > > >>> have
> > > > >> > > > >>> > not heard from server in 61103ms for sessionid
> > > > >> > > > >>> > 0x23462a4cf93a8fc,
> > > > >> > > > >>> closing
> > > > >> > > > >>> > socket connection and attempting reconnect
> > > > >> > > > >>> > 11/12/29 00:01:01 INFO zookeeper.ClientCnxn: Client
> > > session
> > > > >> > > > >>> > timed
> > > > >> > > > out,
> > > > >> > > > >>> have
> > > > >> > > > >>> > not heard from server in 61205ms for sessionid
> > > > >> > > > >>> > 0x346c561a55953e,
> > > > >> > > > closing
> > > > >> > > > >>> > socket connection and attempting reconnect
> > > > >> > > > >>> > ---------------------------
> > > > >> > > > >>> >
> > > > >> > > > >>> > And the Handlers start to fail:
> > > > >> > > > >>> >
> > > > >> > > > >>> > ---------------------------
> > > > >> > > > >>> > 11/12/29 00:01:01 WARN ipc.HBaseServer: IPC Server
> > > > Responder,
> > > > >> > > > >>> > call
> > > > >> > > > >>> >
> > multi(org.apache.hadoop.hbase.client.MultiAction@66b1d0bf
> > > )
> > > > >> > > > >>> > from
> > > > >> > > > >>> > xx.xx.xx.xx:xxxx: output error
> > > > >> > > > >>> > 11/12/29 00:01:01 WARN ipc.HBaseServer: IPC Server
> > handler
> > > > 81
> > > > >> > > > >>> > on
> > > > >> > > > 60020
> > > > >> > > > >>> > caught: java.nio.channels.ClosedChannelException
> > > > >> > > > >>> >         at
> > > > >> > > > >>> >
> > > > >> > > >
> > > > >> >
> > sun.nio.ch.SocketChannelImpl.ensureWriteOpen(SocketChannelImpl.java:
> > > > >> > > > 13
> > > > >> > > > >>> > 3)
> > > > >> > > > >>> >         at
> > > > >> > > > >>>
> > > sun.nio.ch.SocketChannelImpl.write(SocketChannelImpl.java:324)
> > > > >> > > > >>> >         at
> > > > >> > > > >>> >
> > > > >> > > >
> > > > >> >
> > > org.apache.hadoop.hbase.ipc.HBaseServer.channelWrite(HBaseServer.java=
:
> > > > >> > > > >>> > 1341)
> > > > >> > > > >>> >         at
> > > > >> > > > >>> >
> > > > >> > org.apache.hadoop.hbase.ipc.HBaseServer$Responder.processRespo
> > > > >> > > > >>> > ns
> > > > >> > > > >>> > e(HB
> > > > >> > > > >>> > aseServer.java:727)
> > > > >> > > > >>> >         at
> > > > >> > > > >>> >
> > > > >> > org.apache.hadoop.hbase.ipc.HBaseServer$Responder.doRespond(HB
> > > > >> > > > >>> > as
> > > > >> > > > >>> > eSe
> > > > >> > > > >>> > rver.java:792)
> > > > >> > > > >>> >         at
> > > > >> > > > >>> >
> > > > >> > > >
> > > > >> >
> > org.apache.hadoop.hbase.ipc.HBaseServer$Handler.run(HBaseServer.java
> > > > >> > > > :1
> > > > >> > > > >>> > 083)
> > > > >> > > > >>> >
> > > > >> > > > >>> > 11/12/29 00:01:01 WARN ipc.HBaseServer: IPC Server
> > > > Responder,
> > > > >> > > > >>> > call
> > > > >> > > > >>> >
> > multi(org.apache.hadoop.hbase.client.MultiAction@29c9e430
> > > )
> > > > >> > > > >>> > from
> > > > >> > > > >>> > xx.xx.xx.xx:xxxx: output error
> > > > >> > > > >>> > 11/12/29 00:01:01 WARN ipc.HBaseServer: IPC Server
> > handler
> > > > 62
> > > > >> > > > >>> > on
> > > > >> > > > 60020
> > > > >> > > > >>> > caught: java.nio.channels.ClosedChannelException
> > > > >> > > > >>> >         at
> > > > >> > > > >>> >
> > > > >> > > >
> > > > >> >
> > sun.nio.ch.SocketChannelImpl.ensureWriteOpen(SocketChannelImpl.java:
> > > > >> > > > 13
> > > > >> > > > >>> > 3)
> > > > >> > > > >>> >         at
> > > > >> > > > >>>
> > > sun.nio.ch.SocketChannelImpl.write(SocketChannelImpl.java:324)
> > > > >> > > > >>> >         at
> > > > >> > > > >>> >
> > > > >> > > >
> > > > >> >
> > > org.apache.hadoop.hbase.ipc.HBaseServer.channelWrite(HBaseServer.java=
:
> > > > >> > > > >>> > 1341)
> > > > >> > > > >>> >         at
> > > > >> > > > >>> >
> > > > >> > org.apache.hadoop.hbase.ipc.HBaseServer$Responder.processRespo
> > > > >> > > > >>> > ns
> > > > >> > > > >>> > e(HB
> > > > >> > > > >>> > aseServer.java:727)
> > > > >> > > > >>> >         at
> > > > >> > > > >>> >
> > > > >> > org.apache.hadoop.hbase.ipc.HBaseServer$Responder.doRespond(HB
> > > > >> > > > >>> > as
> > > > >> > > > >>> > eSe
> > > > >> > > > >>> > rver.java:792)
> > > > >> > > > >>> >         at
> > > > >> > > > >>> >
> > > > >> > > >
> > > > >> >
> > org.apache.hadoop.hbase.ipc.HBaseServer$Handler.run(HBaseServer.java
> > > > >> > > > :1
> > > > >> > > > >>> > 083)
> > > > >> > > > >>> > ---------------------------
> > > > >> > > > >>> >
> > > > >> > > > >>> > And finally the server throws a YouAreDeadException
> :( :
> > > > >> > > > >>> >
> > > > >> > > > >>> > ---------------------------
> > > > >> > > > >>> > 11/12/29 00:01:02 INFO zookeeper.ClientCnxn: Opening
> > > socket
> > > > >> > > > connection
> > > > >> > > > >>> to
> > > > >> > > > >>> > server ip-xx-xx-xx-xx.ec2.internal/xx.xx.xx.xx:2181
> > > > >> > > > >>> > 11/12/29 00:01:02 INFO zookeeper.ClientCnxn: Socket
> > > > connection
> > > > >> > > > >>> > established to
> > > ip-xx-xx-xx-xx.ec2.internal/xx.xx.xx.xx:2181,
> > > > >> > > > initiating
> > > > >> > > > >>> session
> > > > >> > > > >>> > 11/12/29 00:01:02 INFO zookeeper.ClientCnxn: Unable =
to
> > > > >> > > > >>> > reconnect to ZooKeeper service, session
> > 0x23462a4cf93a8fc
> > > > has
> > > > >> > > > >>> > expired, closing
> > > > >> > > > socket
> > > > >> > > > >>> > connection
> > > > >> > > > >>> > 11/12/29 00:01:02 INFO zookeeper.ClientCnxn: Opening
> > > socket
> > > > >> > > > connection
> > > > >> > > > >>> to
> > > > >> > > > >>> > server ip-xx-xx-xx-xx.ec2.internal/xx.xx.xx.xx:2181
> > > > >> > > > >>> > 11/12/29 00:01:02 INFO zookeeper.ClientCnxn: Socket
> > > > connection
> > > > >> > > > >>> > established to
> > > ip-xx-xx-xx-xx.ec2.internal/xx.xx.xx.xx:2181,
> > > > >> > > > initiating
> > > > >> > > > >>> session
> > > > >> > > > >>> > 11/12/29 00:01:02 INFO zookeeper.ClientCnxn: Unable =
to
> > > > >> > > > >>> > reconnect to ZooKeeper service, session
> 0x346c561a55953e
> > > has
> > > > >> > > > >>> > expired, closing
> > > > >> > > > socket
> > > > >> > > > >>> > connection
> > > > >> > > > >>> > 11/12/29 00:01:03 FATAL regionserver.HRegionServer:
> > > ABORTING
> > > > >> > > > >>> > region server
> > > > >> > > > >>> >
> > > serverName=3Dip-xx-xx-xx-xx.ec2.internal,60020,1324994490741,
> > > > >> > > > >>> > load=3D(requests=3D447, regions=3D206, usedHeap=3D15=
84,
> > > > >> > maxHeap=3D4083):
> > > > >> > > > >>> > Unhandled
> > > > >> > > > >>> > exception:
> org.apache.hadoop.hbase.YouAreDeadException:
> > > > >> > Server
> > > > >> > > > >>> > REPORT rejected; currently processing
> > > > >> > > > >>> > ip-xx-xx-xx-xx.ec2.internal,60020,1324994490741 as
> dead
> > > > server
> > > > >> > > > >>> > org.apache.hadoop.hbase.YouAreDeadException:
> > > > >> > > > >>> > org.apache.hadoop.hbase.YouAreDeadException: Server
> > REPORT
> > > > >> > > > >>> > rejected; currently processing
> > > > >> > > > >>> > ip-xx-xx-xx-xx.ec2.internal,60020,1324994490741
> > > > >> > > > as
> > > > >> > > > >>> > dead server
> > > > >> > > > >>> >         at
> > > > >> > > >
> sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native
> > > > >> > > > >>> > Method)
> > > > >> > > > >>> >         at
> > > > >> > > > >>> >
> > > > >> > > >
> > > > sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstru=
c
> > > > >> > > > to
> > > > >> > > > r
> > > > >> > > > >>> > AccessorImpl.java:39)
> > > > >> > > > >>> >         at
> > > > >> > > > >>> >
> > > > >> > > >
> > > > sun.reflect.DelegatingConstructorAccessorImpl.newInstance(Delegatin=
g
> > > > >> > > > Co
> > > > >> > > > n
> > > > >> > > > >>> > structorAccessorImpl.java:27)
> > > > >> > > > >>> >         at
> > > > >> > > > >>>
> > > > java.lang.reflect.Constructor.newInstance(Constructor.java:513)
> > > > >> > > > >>> >         at
> > > > >> > > > >>> >
> > > > >> > org.apache.hadoop.ipc.RemoteException.instantiateException(Rem
> > > > >> > > > >>> > ot
> > > > >> > > > >>> > eExce
> > > > >> > > > >>> > ption.java:95)
> > > > >> > > > >>> >         at
> > > > >> > > > >>> >
> > > > >> > org.apache.hadoop.ipc.RemoteException.unwrapRemoteException(Re
> > > > >> > > > >>> > mo
> > > > >> > > > >>> > te
> > > > >> > > > >>> > Exception.java:79)
> > > > >> > > > >>> >         at
> > > > >> > > > >>> >
> > > > >> > org.apache.hadoop.hbase.regionserver.HRegionServer.tryRegionSe
> > > > >> > > > >>> > rv
> > > > >> > > > >>> > erRep
> > > > >> > > > >>> > ort(HRegionServer.java:735)
> > > > >> > > > >>> >         at
> > > > >> > > > >>> >
> > > > >> > > >
> > > > >> >
> > org.apache.hadoop.hbase.regionserver.HRegionServer.run(HRegionServer
> > > > >> > > > .j
> > > > >> > > > >>> > ava:596)
> > > > >> > > > >>> >         at java.lang.Thread.run(Thread.java:662)
> > > > >> > > > >>> > Caused by: org.apache.hadoop.ipc.RemoteException:
> > > > >> > > > >>> > org.apache.hadoop.hbase.YouAreDeadException: Server
> > REPORT
> > > > >> > > > >>> > rejected; currently processing
> > > > >> > > > >>> > ip-xx-xx-xx-xx.ec2.internal,60020,1324994490741
> > > > >> > > > as
> > > > >> > > > >>> > dead server
> > > > >> > > > >>> >         at
> > > > >> > > > >>> >
> > > > >> > org.apache.hadoop.hbase.master.ServerManager.checkIsDead(Serve
> > > > >> > > > >>> > rM
> > > > >> > > > >>> > ana
> > > > >> > > > >>> > ger.java:204)
> > > > >> > > > >>> >         at
> > > > >> > > > >>> >
> > > > >> > org.apache.hadoop.hbase.master.ServerManager.regionServerRepor
> > > > >> > > > >>> > t(
> > > > >> > > > >>> > Serv
> > > > >> > > > >>> > erManager.java:262)
> > > > >> > > > >>> >         at
> > > > >> > > > >>> >
> > > > >> > org.apache.hadoop.hbase.master.HMaster.regionServerReport(HMas
> > > > >> > > > >>> > te
> > > > >> > > > >>> > r.jav
> > > > >> > > > >>> > a:669)
> > > > >> > > > >>> >         at
> > > > sun.reflect.GeneratedMethodAccessor3.invoke(Unknown
> > > > >> > > > Source)
> > > > >> > > > >>> >         at
> > > > >> > > > >>> >
> > > > >> > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMeth
> > > > >> > > > >>> > od
> > > > >> > > > >>> > Acces
> > > > >> > > > >>> > sorImpl.java:25)
> > > > >> > > > >>> >         at
> > > java.lang.reflect.Method.invoke(Method.java:597)
> > > > >> > > > >>> >         at
> > > > >> > > > >>> >
> > > > >> >
> > org.apache.hadoop.hbase.ipc.HBaseRPC$Server.call(HBaseRPC.java:570)
> > > > >> > > > >>> >         at
> > > > >> > > > >>> >
> > > > >> > > >
> > > > >> >
> > org.apache.hadoop.hbase.ipc.HBaseServer$Handler.run(HBaseServer.java
> > > > >> > > > :1
> > > > >> > > > >>> > 039)
> > > > >> > > > >>> >
> > > > >> > > > >>> >         at
> > > > >> > > > >>> >
> > > > >> >
> org.apache.hadoop.hbase.ipc.HBaseClient.call(HBaseClient.java:771)
> > > > >> > > > >>> >         at
> > > > >> > > > >>> >
> > > > >> > org.apache.hadoop.hbase.ipc.HBaseRPC$Invoker.invoke(HBaseRPC.j
> > > > >> > > > >>> > av
> > > > >> > > > >>> > a:257
> > > > >> > > > >>> > )
> > > > >> > > > >>> >         at $Proxy6.regionServerReport(Unknown Source=
)
> > > > >> > > > >>> >         at
> > > > >> > > > >>> >
> > > > >> > org.apache.hadoop.hbase.regionserver.HRegionServer.tryRegionSe
> > > > >> > > > >>> > rv
> > > > >> > > > >>> > erRep
> > > > >> > > > >>> > ort(HRegionServer.java:729)
> > > > >> > > > >>> >         ... 2 more
> > > > >> > > > >>> > 11/12/29 00:01:03 INFO regionserver.HRegionServer:
> Dump
> > of
> > > > >> > metrics:
> > > > >> > > > >>> > requests=3D66, regions=3D206, stores=3D2078, storefi=
les=3D970,
> > > > >> > > > >>> > storefileIndexSize=3D78, memstoreSize=3D796,
> > > > >> > > > >>> > compactionQueueSize=3D0, flushQueueSize=3D0,
> usedHeap=3D1672,
> > > > >> > > > >>> > maxHeap=3D4083, blockCacheSize=3D705907552,
> > > > >> > > > >>> > blockCacheFree=3D150412064, blockCacheCount=3D10648,
> > > > >> > > > >>> > blockCacheHitCount=3D79578618,
> > blockCacheMissCount=3D3036335,
> > > > >> > > > >>> > blockCacheEvictedCount=3D1401352, blockCacheHitRatio=
=3D96,
> > > > >> > > > >>> > blockCacheHitCachingRatio=3D98
> > > > >> > > > >>> > 11/12/29 00:01:03 INFO regionserver.HRegionServer:
> > > STOPPED:
> > > > >> > > > >>> > Unhandled
> > > > >> > > > >>> > exception:
> org.apache.hadoop.hbase.YouAreDeadException:
> > > > >> > Server
> > > > >> > > > >>> > REPORT rejected; currently processing
> > > > >> > > > >>> > ip-xx-xx-xx-xx.ec2.internal,60020,1324994490741 as
> dead
> > > > server
> > > > >> > > > >>> > 11/12/29 00:01:03 INFO ipc.HBaseServer: Stopping
> server
> > on
> > > > >> > > > >>> > 60020
> > > > >> > > > >>> > ---------------------------
> > > > >> > > > >>> >
> > > > >> > > > >>> > Then i restart the RegionServer and everything is ba=
ck
> > to
> > > > >> normal.
> > > > >> > > > >>> > Reading the DataNode, Zookeeper and TaskTracker logs=
,
> i
> > > > don't
> > > > >> > > > >>> > see any abnormality in the same time window.
> > > > >> > > > >>> > I think it was caused by the lost of connection to
> > > > zookeeper.
> > > > >> > > > >>> > Is it
> > > > >> > > > >>> advisable to
> > > > >> > > > >>> > run zookeeper in the same machines?
> > > > >> > > > >>> > if the RegionServer lost it's connection to Zookeepe=
r,
> > > > there's
> > > > >> > > > >>> > a way
> > > > >> > > > (a
> > > > >> > > > >>> > configuration perhaps) to re-join the cluster, and n=
ot
> > > only
> > > > >> die?
> > > > >> > > > >>> >
> > > > >> > > > >>> > Any idea what is causing this?? Or to prevent it fro=
m
> > > > >> happening?
> > > > >> > > > >>> >
> > > > >> > > > >>> > Any help is appreciated.
> > > > >> > > > >>> >
> > > > >> > > > >>> > Best Regards,
> > > > >> > > > >>> >
> > > > >> > > > >>> > --
> > > > >> > > > >>> >
> > > > >> > > > >>> > *Leonardo Gamas*
> > > > >> > > > >>> > Software Engineer
> > > > >> > > > >>> > +557134943514
> > > > >> > > > >>> > +557581347440
> > > > >> > > > >>> > leogamas@jusbrasil.com.br
> > > > >> > > > >>> > www.jusbrasil.com.br
> > > > >> > > > >>>
> > > > >> > > > >>
> > > > >> > > > >>
> > > > >> > > > >>
> > > > >> > > > >> --
> > > > >> > > > >>
> > > > >> > > > >> *Leonardo Gamas*
> > > > >> > > > >> Software Engineer/Chaos Monkey Engineer T (71) 3494-351=
4C
> > > > (75)
> > > > >> > > > >> 8134-7440 leogamas@jusbrasil.com.br www.jusbrasil.com.b=
r
> > > > >> > > > >>
> > > > >> > > > >>
> > > > >> > > > >
> > > > >> > > > >
> > > > >> > > > > --
> > > > >> > > > >
> > > > >> > > > > *Leonardo Gamas*
> > > > >> > > > > Software Engineer/Chaos Monkey Engineer T (71) 3494-3514=
C
> > > (75)
> > > > >> > > > > 8134-7440 leogamas@jusbrasil.com.br www.jusbrasil.com.br
> > > > >> > > >
> > > > >> > > >
> > > > >> > >
> > > > >> > >
> > > > >> > > --
> > > > >> > >
> > > > >> > > *Leonardo Gamas*
> > > > >> > > Software Engineer
> > > > >> > > +557134943514
> > > > >> > > +557581347440
> > > > >> > > leogamas@jusbrasil.com.br
> > > > >> > > www.jusbrasil.com.br
> > > > >> > >
> > > > >> >
> > > > >> >
> > > > >> >
> > > > >> > --
> > > > >> >
> > > > >> > *Leonardo Gamas*
> > > > >> > Software Engineer
> > > > >> > T +55 (71) 3494-3514
> > > > >> > C +55 (75) 8134-7440
> > > > >> > leogamas@jusbrasil.com.br
> > > > >> > www.jusbrasil.com.br
> > > > >>
> > > > >
> > > > >
> > > > >
> > > > > --
> > > > >
> > > > > *Leonardo Gamas*
> > > > >
> > > > > Software Engineer
> > > > > T +55 (71) 3494-3514
> > > > > C +55 (75) 8134-7440
> > > > > leogamas@jusbrasil.com.br
> > > > >
> > > > > www.jusbrasil.com.br
> > > > >
> > > > >
> > > >
> > > >
> > > > --
> > > >
> > > > *Leonardo Gamas*
> > > > Software Engineer
> > > > T +55 (71) 3494-3514
> > > > C +55 (75) 8134-7440
> > > > leogamas@jusbrasil.com.br
> > > > www.jusbrasil.com.br
> > > >
> > >
> >
> >
> >
> > --
> >
> > *Leonardo Gamas*
> > Software Engineer
> > T +55 (71) 3494-3514
> > C +55 (75) 8134-7440
> > leogamas@jusbrasil.com.br
> > www.jusbrasil.com.br
> >
>


--=20

*Leonardo Gamas*
Software Engineer
T +55 (71) 3494-3514
C +55 (75) 8134-7440
leogamas@jusbrasil.com.br
www.jusbrasil.com.br

--0016e6dedde1adca4104b7096fa8--