Return-Path: X-Original-To: apmail-hbase-user-archive@www.apache.org Delivered-To: apmail-hbase-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 1749295C0 for ; Sat, 21 Jan 2012 13:04:56 +0000 (UTC) Received: (qmail 16888 invoked by uid 500); 21 Jan 2012 13:04:54 -0000 Delivered-To: apmail-hbase-user-archive@hbase.apache.org Received: (qmail 16831 invoked by uid 500); 21 Jan 2012 13:04:53 -0000 Mailing-List: contact user-help@hbase.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@hbase.apache.org Delivered-To: mailing list user@hbase.apache.org Received: (qmail 16823 invoked by uid 99); 21 Jan 2012 13:04:52 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Sat, 21 Jan 2012 13:04:52 +0000 X-ASF-Spam-Status: No, hits=2.5 required=5.0 tests=FRT_ADOBE2,HTML_MESSAGE,RCVD_IN_DNSWL_LOW,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (nike.apache.org: domain of leogamas@jusbrasil.com.br designates 74.125.82.169 as permitted sender) Received: from [74.125.82.169] (HELO mail-we0-f169.google.com) (74.125.82.169) by apache.org (qpsmtpd/0.29) with ESMTP; Sat, 21 Jan 2012 13:04:43 +0000 Received: by werl4 with SMTP id l4so1400812wer.14 for ; Sat, 21 Jan 2012 05:04:23 -0800 (PST) MIME-Version: 1.0 Received: by 10.216.136.38 with SMTP id v38mr917973wei.23.1327151063091; Sat, 21 Jan 2012 05:04:23 -0800 (PST) Received: by 10.216.62.78 with HTTP; Sat, 21 Jan 2012 05:04:22 -0800 (PST) X-Originating-IP: [186.207.181.11] In-Reply-To: References: <001d01cccb7e$4cb35240$e619f6c0$%vasudevan@huawei.com> Date: Sat, 21 Jan 2012 11:04:22 -0200 Message-ID: Subject: Re: RegionServer dying every two or three days From: Leonardo Gamas To: user@hbase.apache.org Content-Type: multipart/alternative; boundary=0016e6dedde1adca4104b7096fa8 X-Virus-Checked: Checked by ClamAV on apache.org --0016e6dedde1adca4104b7096fa8 Content-Type: text/plain; charset=windows-1252 Content-Transfer-Encoding: quoted-printable Thanks Matt for this insightful article, I will run my cluster with c1.xlarge to test it's performance. But i'm concerned with this machine, because the amount of RAM available, only 7GB. How many map/reduce slots do you configure? And the amount of Heap for HBase? How many regions per RegionServer could my cluster support? 2012/1/20 Matt Corgan > I run c1.xlarge servers and have found them very stable. I see 100 Mbit/= s > sustained bi-directional network throughput (200Mbit/s total), sometimes = up > to 150 * 2 Mbit/s. > > Here's a pretty thorough examination of the underlying hardware: > > > http://huanliu.wordpress.com/2010/06/14/amazons-physical-hardware-and-ec2= -compute-unit/ > > > *High-CPU instances* > > The high-CPU instances (c1.medium, c1.xlarge) run on systems with > dual-socket Intel Xeon E5410 2.33GHz processors. It is dual-socket becaus= e > we see APIC IDs 0 to 7, and E5410 only has 4 cores. A c1.xlarge instance > almost takes up the whole physical machine. However, we frequently observ= e > steal cycle on a c1.xlarge instance ranging from 0% to 25% with an averag= e > of about 10%. The amount of steal cycle is not enough to host another > smaller VM, i.e., a c1.medium. Maybe those steal cycles are used to run > Amazon=92s software firewall (security group). On Passmark-CPU mark, a > c1.xlarge machine achieves 7,962.6, actually higher than an average > dual-sock E5410 system is able to achieve (average is 6,903). > > > > On Fri, Jan 20, 2012 at 8:03 AM, Leonardo Gamas > wrote: > > > Thanks Neil for sharing your experience with AWS! Could you tell what > > instance type are you using? > > We are using m1.xlarge, that has 4 virtual cores, but i normally see > > recommendations for machines with 8 cores like c1.xlarge, m2.4xlarge, > etc. > > In principle these 8-core machines don't suffer too much with I/O > problems > > since they don't share the physical server. Is there any piece of > > information from Amazon or other source that affirms that or it's based > in > > empirical analysis? > > > > 2012/1/19 Neil Yalowitz > > > > > We have experienced many problems with our cluster on EC2. The blunt > > > solution was to increase the Zookeeper timeout to 5 minutes or even > more. > > > > > > Even with a long timeout, however, it's not uncommon for us to see an > EC2 > > > instance to become unresponsive to pings and SSH several times during= a > > > week. It's been a very bad environment for clusters. > > > > > > > > > Neil > > > > > > On Thu, Jan 19, 2012 at 11:49 AM, Leonardo Gamas > > > wrote: > > > > > > > Hi Guys, > > > > > > > > I have tested the parameters provided by Sandy, and it solved the G= C > > > > problems with the -XX:+UseParallelOldGC, thanks for the help Sandy. > > > > I'm still experiencing some difficulties, the RegionServer continue= s > to > > > > shutdown, but it seems related to I/O. It starts to timeout many > > > > connections, new connections to/from the machine timeout too, and > > finally > > > > the RegionServer dies because of YouAreDeadException. I will collec= t > > more > > > > data, but i think it's an Amazon/Virtualized Environment inherent > > issue. > > > > > > > > Thanks for the great help provided so far. > > > > > > > > 2012/1/5 Leonardo Gamas > > > > > > > > > I don't think so, if Amazon stopped the machine it would cause a > stop > > > of > > > > > minutes, not seconds, and since the DataNode, TaskTracker and > > Zookeeper > > > > > continue to work normally. > > > > > But it can be related to the shared environment nature of Amazon, > > maybe > > > > > some spike in I/O caused by another virtualized server in the sam= e > > > > physical > > > > > machine. > > > > > > > > > > But the intance type i'm using: > > > > > > > > > > *Extra Large Instance* > > > > > > > > > > 15 GB memory > > > > > 8 EC2 Compute Units (4 virtual cores with 2 EC2 Compute Units eac= h) > > > > > 1,690 GB instance storage > > > > > 64-bit platform > > > > > I/O Performance: High > > > > > API name: m1.xlarge > > > > > I was not expecting to suffer from this problems, or at least not > > much. > > > > > > > > > > > > > > > 2012/1/5 Sandy Pratt > > > > > > > > > >> You think it's an Amazon problem maybe? Like they paused or > > migrated > > > > >> your virtual machine, and it just happens to be during GC, leavi= ng > > us > > > to > > > > >> think the GC ran long when it didn't? I don't have a lot of > > > experience > > > > >> with Amazon so I don't know if that sort of thing is common. > > > > >> > > > > >> > -----Original Message----- > > > > >> > From: Leonardo Gamas [mailto:leogamas@jusbrasil.com.br] > > > > >> > Sent: Thursday, January 05, 2012 13:15 > > > > >> > To: user@hbase.apache.org > > > > >> > Subject: Re: RegionServer dying every two or three days > > > > >> > > > > > >> > I checked the CPU Utilization graphics provided by Amazon (it'= s > > not > > > > >> accurate, > > > > >> > since the sample time is about 5 minutes) and don't see any > > > > >> abnormality. I > > > > >> > will setup TSDB with Nagios to have a more reliable source of > > > > >> performance > > > > >> > data. > > > > >> > > > > > >> > The machines don't have swap space, if i run: > > > > >> > > > > > >> > $ swapon -s > > > > >> > > > > > >> > To display swap usage summary, it returns an empty list. > > > > >> > > > > > >> > I will drop -XX:+DoEscapeAnalysis -XX:+AggressiveOpts in my to > > > tests. > > > > >> > > > > > >> > I don't have payed much attention to the value of the new size > > > param. > > > > >> > > > > > >> > Thanks again for the help!! > > > > >> > > > > > >> > 2012/1/5 Sandy Pratt > > > > >> > > > > > >> > > That size heap doesn't seem like it should cause a 36 second > GC > > (a > > > > >> > > minor GC even if I remember your logs correctly), so I tend = to > > > think > > > > >> > > that other things are probably going on. > > > > >> > > > > > > >> > > This line here: > > > > >> > > > > > > >> > > 14251.690: [GC 14288.620: [ParNew: 105352K->413K(118016K), > > > 0.0361840 > > > > >> > > secs] > > > > >> > > 954388K->849478K(1705776K), 0.0364200 secs] [Times: user=3D0= .05 > > > > >> > > 954388K->sys=3D0.01, > > > > >> > > real=3D36.96 secs] > > > > >> > > > > > > >> > > is really mysterious to me. It seems to indicate that the > > process > > > > was > > > > >> > > blocked for almost 37 seconds during a minor collection. No= te > > the > > > > CPU > > > > >> > > times are very low but the wall time is very high. If it wa= s > > > > actually > > > > >> > > doing GC work, I'd expect to see user time higher than real > > time, > > > as > > > > >> > > it is in other parallel collections (see your log snippet). > > Were > > > > you > > > > >> > > really so CPU starved that it took 37 seconds to get in 50ms > of > > > > work? > > > > >> > > I can't make sense of that. I'm trying to think of somethin= g > > that > > > > >> > > would block you for that long while all your threads are > stopped > > > for > > > > >> > > GC, other than being in swap, but I can't come up with > anything. > > > > >> You're > > > > >> > certain you're not in swap? > > > > >> > > > > > > >> > > Maybe drop these guys -XX:+DoEscapeAnalysis > -XX:+AggressiveOpts > > > > while > > > > >> > > you troubleshoot? > > > > >> > > > > > > >> > > Why is your new size so small? This generally means that > > > relatively > > > > >> > > more objects are being tenured than would be with a larger n= ew > > > size. > > > > >> > > This could make collections of the old gen worse (GC time is > > said > > > to > > > > >> > > be proportional to the number of live objects in the > generation, > > > and > > > > >> > > CMS does indeed cause STW pauses). A typical new to tenured > > ratio > > > > >> > > might be 1:3. Were the new gen GCs taking too long? This i= s > > > > probably > > > > >> > > orthogonal to your immediate issue, though. > > > > >> > > > > > > >> > > > > > > >> > > > > > > >> > > -----Original Message----- > > > > >> > > From: Leonardo Gamas [mailto:leogamas@jusbrasil.com.br] > > > > >> > > Sent: Thursday, January 05, 2012 5:33 AM > > > > >> > > To: user@hbase.apache.org > > > > >> > > Subject: Re: RegionServer dying every two or three days > > > > >> > > > > > > >> > > St.Ack, > > > > >> > > > > > > >> > > I don't have made any attempt in GC tunning, yet. > > > > >> > > I will read the perf section as suggested. > > > > >> > > I'm currently using Nagios + JMX to monitor the cluster, but > > it's > > > > >> > > currently used for alert only, the perfdata is not been > stored, > > so > > > > >> > > it's kind of useless right now, but i was thinking in use TS= DB > > to > > > > >> > > store it, any known case of integration? > > > > >> > > --- > > > > >> > > > > > > >> > > Sandy, > > > > >> > > > > > > >> > > Yes, my timeout is 30 seconds: > > > > >> > > > > > > >> > > > > > > >> > > zookeeper.session.timeout > > > > >> > > 30000 > > > > >> > > > > > > >> > > > > > > >> > > To our application it's a sufferable time to wait in case a > > > > >> > > RegionServer go offline. > > > > >> > > > > > > >> > > My heap is 4GB and my JVM params are: > > > > >> > > > > > > >> > > -Xmx4096m -server -XX:+UseParNewGC -XX:+UseConcMarkSweepGC > > > > >> > > -XX:CMSInitiatingOccupancyFraction=3D70 -XX:NewSize=3D128m > > > > >> > > -XX:MaxNewSize=3D128m -XX:+DoEscapeAnalysis -XX:+AggressiveO= pts > > > > >> > > -verbose:gc -XX:+PrintGCDetails -XX:+PrintGCTimeStamps > > > > >> > > -Xloggc:/usr/lib/hbase/logs/hbase-regionserver-gc.log > > > > >> > > > > > > >> > > I will try the -XX:+UseParallelOldGC param and post my > feedback > > > > here. > > > > >> > > --- > > > > >> > > > > > > >> > > Ramkrishna, > > > > >> > > > > > > >> > > Seems the GC is the root of all evil in this case. > > > > >> > > ---- > > > > >> > > > > > > >> > > Thank you all for the answers. I will try out these valuable > > > advices > > > > >> > > given here and post my results. > > > > >> > > > > > > >> > > Leo Gamas. > > > > >> > > > > > > >> > > 2012/1/5 Ramkrishna S Vasudevan < > > ramkrishna.vasudevan@huawei.com> > > > > >> > > > > > > >> > > > Recently we faced a similar problem and it was due to GC > > config. > > > > >> > > > Pls check your GC. > > > > >> > > > > > > > >> > > > Regards > > > > >> > > > Ram > > > > >> > > > > > > > >> > > > -----Original Message----- > > > > >> > > > From: saint.ack@gmail.com [mailto:saint.ack@gmail.com] On > > > Behalf > > > > Of > > > > >> > > > Stack > > > > >> > > > Sent: Thursday, January 05, 2012 2:50 AM > > > > >> > > > To: user@hbase.apache.org > > > > >> > > > Subject: Re: RegionServer dying every two or three days > > > > >> > > > > > > > >> > > > On Wed, Jan 4, 2012 at 12:17 PM, Leonardo Gamas > > > > >> > > > wrote: > > > > >> > > > > The third line took 36.96 seconds to execute, can this b= e > > > > causing > > > > >> > > > > this problem? > > > > >> > > > > > > > > >> > > > > > > > >> > > > Probably. Have you made any attempt at GC tuning? > > > > >> > > > > > > > >> > > > > > > > >> > > > > Reading the code a little it seems that, even if it's > > > disabled, > > > > if > > > > >> > > > > all files are target in a compaction, it's considered a > > major > > > > >> > > > > compaction. Is > > > > >> > > > it > > > > >> > > > > right? > > > > >> > > > > > > > > >> > > > > > > > >> > > > That is right. They get 'upgraded' from minor to major. > > > > >> > > > > > > > >> > > > This should be fine though. What you are avoiding setting > > major > > > > >> > > > compactions to 0 is all regions being major compacted on a > > > > period, a > > > > >> > > > heavy weight effective rewrite of all your data (unless > > already > > > > >> major > > > > >> > > > compacted). It looks like you have this disabled which i= s > > good > > > > >> until > > > > >> > > > you've wrestled your cluster into submission. > > > > >> > > > > > > > >> > > > > > > > >> > > > > The machines don't have swap, so the swappiness paramete= r > > > don't > > > > >> > > > > seem to apply here. Any other suggestion? > > > > >> > > > > > > > > >> > > > > > > > >> > > > See the perf section of the hbase manual. It has our > current > > > > list. > > > > >> > > > > > > > >> > > > Are you monitoring your cluster w/ ganglia or tsdb? > > > > >> > > > > > > > >> > > > > > > > >> > > > St.Ack > > > > >> > > > > > > > >> > > > > Thanks. > > > > >> > > > > > > > > >> > > > > 2012/1/4 Leonardo Gamas > > > > >> > > > > > > > > >> > > > >> I will investigate this, thanks for the response. > > > > >> > > > >> > > > > >> > > > >> > > > > >> > > > >> 2012/1/3 Sandy Pratt > > > > >> > > > >> > > > > >> > > > >>> 11/12/29 00:01:01 INFO zookeeper.ClientCnxn: Client > > session > > > > >> > > > >>> timed out, have not heard from server in 61103ms for > > > sessionid > > > > >> > > > >>> 0x23462a4cf93a8fc, closing socket connection and > > attempting > > > > >> > > > >>> reconnect > > > > >> > > > >>> > > > > >> > > > >>> It looks like the process has been unresponsive for so= me > > > time, > > > > >> > > > >>> so ZK > > > > >> > > > has > > > > >> > > > >>> terminated the session. Did you experience a long GC > > pause > > > > >> > > > >>> right > > > > >> > > > before > > > > >> > > > >>> this? If you don't have GC logging enabled for the RS= , > > you > > > > can > > > > >> > > > sometimes > > > > >> > > > >>> tell by noticing a gap in the timestamps of the log > > > statements > > > > >> > > > >>> leading > > > > >> > > > up > > > > >> > > > >>> to the crash. > > > > >> > > > >>> > > > > >> > > > >>> If it turns out to be GC, you might want to look at yo= ur > > > > kernel > > > > >> > > > >>> swappiness setting (set it to 0) and your JVM params. > > > > >> > > > >>> > > > > >> > > > >>> Sandy > > > > >> > > > >>> > > > > >> > > > >>> > > > > >> > > > >>> > -----Original Message----- > > > > >> > > > >>> > From: Leonardo Gamas [mailto: > leogamas@jusbrasil.com.br] > > > > >> > > > >>> > Sent: Thursday, December 29, 2011 07:44 > > > > >> > > > >>> > To: user@hbase.apache.org > > > > >> > > > >>> > Subject: RegionServer dying every two or three days > > > > >> > > > >>> > > > > > >> > > > >>> > Hi, > > > > >> > > > >>> > > > > > >> > > > >>> > I have a HBase/Hadoop CDH3 cluster with 4 machines (= 1 > > > > Master + > > > > >> > > > >>> > 3 > > > > >> > > > >>> Slaves), > > > > >> > > > >>> > running on Amazon EC2. The master is a High-Memory > Extra > > > > Large > > > > >> > > > Instance > > > > >> > > > >>> > (m2.xlarge) with NameNode, JobTracker, HMaster and > > > > Zookeeper. > > > > >> > > > >>> > The slaves are Extra Large Instances (m1.xlarge) > running > > > > >> > > > >>> > Datanode, > > > > >> > > > >>> TaskTracker, > > > > >> > > > >>> > RegionServer and Zookeeper. > > > > >> > > > >>> > > > > > >> > > > >>> > From time to time, every two or three days, one of t= he > > > > >> > > > >>> > RegionServers processes goes down, but the other > > processes > > > > >> > > > >>> > (DataNode, TaskTracker, > > > > >> > > > >>> > Zookeeper) continue normally. > > > > >> > > > >>> > > > > > >> > > > >>> > Reading the logs: > > > > >> > > > >>> > > > > > >> > > > >>> > The connection with Zookeeper timed out: > > > > >> > > > >>> > > > > > >> > > > >>> > --------------------------- > > > > >> > > > >>> > 11/12/29 00:01:01 INFO zookeeper.ClientCnxn: Client > > > session > > > > >> > > > >>> > timed > > > > >> > > > out, > > > > >> > > > >>> have > > > > >> > > > >>> > not heard from server in 61103ms for sessionid > > > > >> > > > >>> > 0x23462a4cf93a8fc, > > > > >> > > > >>> closing > > > > >> > > > >>> > socket connection and attempting reconnect > > > > >> > > > >>> > 11/12/29 00:01:01 INFO zookeeper.ClientCnxn: Client > > > session > > > > >> > > > >>> > timed > > > > >> > > > out, > > > > >> > > > >>> have > > > > >> > > > >>> > not heard from server in 61205ms for sessionid > > > > >> > > > >>> > 0x346c561a55953e, > > > > >> > > > closing > > > > >> > > > >>> > socket connection and attempting reconnect > > > > >> > > > >>> > --------------------------- > > > > >> > > > >>> > > > > > >> > > > >>> > And the Handlers start to fail: > > > > >> > > > >>> > > > > > >> > > > >>> > --------------------------- > > > > >> > > > >>> > 11/12/29 00:01:01 WARN ipc.HBaseServer: IPC Server > > > > Responder, > > > > >> > > > >>> > call > > > > >> > > > >>> > > > multi(org.apache.hadoop.hbase.client.MultiAction@66b1d0bf > > > ) > > > > >> > > > >>> > from > > > > >> > > > >>> > xx.xx.xx.xx:xxxx: output error > > > > >> > > > >>> > 11/12/29 00:01:01 WARN ipc.HBaseServer: IPC Server > > handler > > > > 81 > > > > >> > > > >>> > on > > > > >> > > > 60020 > > > > >> > > > >>> > caught: java.nio.channels.ClosedChannelException > > > > >> > > > >>> > at > > > > >> > > > >>> > > > > > >> > > > > > > > >> > > > sun.nio.ch.SocketChannelImpl.ensureWriteOpen(SocketChannelImpl.java: > > > > >> > > > 13 > > > > >> > > > >>> > 3) > > > > >> > > > >>> > at > > > > >> > > > >>> > > > sun.nio.ch.SocketChannelImpl.write(SocketChannelImpl.java:324) > > > > >> > > > >>> > at > > > > >> > > > >>> > > > > > >> > > > > > > > >> > > > > org.apache.hadoop.hbase.ipc.HBaseServer.channelWrite(HBaseServer.java= : > > > > >> > > > >>> > 1341) > > > > >> > > > >>> > at > > > > >> > > > >>> > > > > > >> > org.apache.hadoop.hbase.ipc.HBaseServer$Responder.processRespo > > > > >> > > > >>> > ns > > > > >> > > > >>> > e(HB > > > > >> > > > >>> > aseServer.java:727) > > > > >> > > > >>> > at > > > > >> > > > >>> > > > > > >> > org.apache.hadoop.hbase.ipc.HBaseServer$Responder.doRespond(HB > > > > >> > > > >>> > as > > > > >> > > > >>> > eSe > > > > >> > > > >>> > rver.java:792) > > > > >> > > > >>> > at > > > > >> > > > >>> > > > > > >> > > > > > > > >> > > > org.apache.hadoop.hbase.ipc.HBaseServer$Handler.run(HBaseServer.java > > > > >> > > > :1 > > > > >> > > > >>> > 083) > > > > >> > > > >>> > > > > > >> > > > >>> > 11/12/29 00:01:01 WARN ipc.HBaseServer: IPC Server > > > > Responder, > > > > >> > > > >>> > call > > > > >> > > > >>> > > > multi(org.apache.hadoop.hbase.client.MultiAction@29c9e430 > > > ) > > > > >> > > > >>> > from > > > > >> > > > >>> > xx.xx.xx.xx:xxxx: output error > > > > >> > > > >>> > 11/12/29 00:01:01 WARN ipc.HBaseServer: IPC Server > > handler > > > > 62 > > > > >> > > > >>> > on > > > > >> > > > 60020 > > > > >> > > > >>> > caught: java.nio.channels.ClosedChannelException > > > > >> > > > >>> > at > > > > >> > > > >>> > > > > > >> > > > > > > > >> > > > sun.nio.ch.SocketChannelImpl.ensureWriteOpen(SocketChannelImpl.java: > > > > >> > > > 13 > > > > >> > > > >>> > 3) > > > > >> > > > >>> > at > > > > >> > > > >>> > > > sun.nio.ch.SocketChannelImpl.write(SocketChannelImpl.java:324) > > > > >> > > > >>> > at > > > > >> > > > >>> > > > > > >> > > > > > > > >> > > > > org.apache.hadoop.hbase.ipc.HBaseServer.channelWrite(HBaseServer.java= : > > > > >> > > > >>> > 1341) > > > > >> > > > >>> > at > > > > >> > > > >>> > > > > > >> > org.apache.hadoop.hbase.ipc.HBaseServer$Responder.processRespo > > > > >> > > > >>> > ns > > > > >> > > > >>> > e(HB > > > > >> > > > >>> > aseServer.java:727) > > > > >> > > > >>> > at > > > > >> > > > >>> > > > > > >> > org.apache.hadoop.hbase.ipc.HBaseServer$Responder.doRespond(HB > > > > >> > > > >>> > as > > > > >> > > > >>> > eSe > > > > >> > > > >>> > rver.java:792) > > > > >> > > > >>> > at > > > > >> > > > >>> > > > > > >> > > > > > > > >> > > > org.apache.hadoop.hbase.ipc.HBaseServer$Handler.run(HBaseServer.java > > > > >> > > > :1 > > > > >> > > > >>> > 083) > > > > >> > > > >>> > --------------------------- > > > > >> > > > >>> > > > > > >> > > > >>> > And finally the server throws a YouAreDeadException > :( : > > > > >> > > > >>> > > > > > >> > > > >>> > --------------------------- > > > > >> > > > >>> > 11/12/29 00:01:02 INFO zookeeper.ClientCnxn: Opening > > > socket > > > > >> > > > connection > > > > >> > > > >>> to > > > > >> > > > >>> > server ip-xx-xx-xx-xx.ec2.internal/xx.xx.xx.xx:2181 > > > > >> > > > >>> > 11/12/29 00:01:02 INFO zookeeper.ClientCnxn: Socket > > > > connection > > > > >> > > > >>> > established to > > > ip-xx-xx-xx-xx.ec2.internal/xx.xx.xx.xx:2181, > > > > >> > > > initiating > > > > >> > > > >>> session > > > > >> > > > >>> > 11/12/29 00:01:02 INFO zookeeper.ClientCnxn: Unable = to > > > > >> > > > >>> > reconnect to ZooKeeper service, session > > 0x23462a4cf93a8fc > > > > has > > > > >> > > > >>> > expired, closing > > > > >> > > > socket > > > > >> > > > >>> > connection > > > > >> > > > >>> > 11/12/29 00:01:02 INFO zookeeper.ClientCnxn: Opening > > > socket > > > > >> > > > connection > > > > >> > > > >>> to > > > > >> > > > >>> > server ip-xx-xx-xx-xx.ec2.internal/xx.xx.xx.xx:2181 > > > > >> > > > >>> > 11/12/29 00:01:02 INFO zookeeper.ClientCnxn: Socket > > > > connection > > > > >> > > > >>> > established to > > > ip-xx-xx-xx-xx.ec2.internal/xx.xx.xx.xx:2181, > > > > >> > > > initiating > > > > >> > > > >>> session > > > > >> > > > >>> > 11/12/29 00:01:02 INFO zookeeper.ClientCnxn: Unable = to > > > > >> > > > >>> > reconnect to ZooKeeper service, session > 0x346c561a55953e > > > has > > > > >> > > > >>> > expired, closing > > > > >> > > > socket > > > > >> > > > >>> > connection > > > > >> > > > >>> > 11/12/29 00:01:03 FATAL regionserver.HRegionServer: > > > ABORTING > > > > >> > > > >>> > region server > > > > >> > > > >>> > > > > serverName=3Dip-xx-xx-xx-xx.ec2.internal,60020,1324994490741, > > > > >> > > > >>> > load=3D(requests=3D447, regions=3D206, usedHeap=3D15= 84, > > > > >> > maxHeap=3D4083): > > > > >> > > > >>> > Unhandled > > > > >> > > > >>> > exception: > org.apache.hadoop.hbase.YouAreDeadException: > > > > >> > Server > > > > >> > > > >>> > REPORT rejected; currently processing > > > > >> > > > >>> > ip-xx-xx-xx-xx.ec2.internal,60020,1324994490741 as > dead > > > > server > > > > >> > > > >>> > org.apache.hadoop.hbase.YouAreDeadException: > > > > >> > > > >>> > org.apache.hadoop.hbase.YouAreDeadException: Server > > REPORT > > > > >> > > > >>> > rejected; currently processing > > > > >> > > > >>> > ip-xx-xx-xx-xx.ec2.internal,60020,1324994490741 > > > > >> > > > as > > > > >> > > > >>> > dead server > > > > >> > > > >>> > at > > > > >> > > > > sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native > > > > >> > > > >>> > Method) > > > > >> > > > >>> > at > > > > >> > > > >>> > > > > > >> > > > > > > > sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstru= c > > > > >> > > > to > > > > >> > > > r > > > > >> > > > >>> > AccessorImpl.java:39) > > > > >> > > > >>> > at > > > > >> > > > >>> > > > > > >> > > > > > > > sun.reflect.DelegatingConstructorAccessorImpl.newInstance(Delegatin= g > > > > >> > > > Co > > > > >> > > > n > > > > >> > > > >>> > structorAccessorImpl.java:27) > > > > >> > > > >>> > at > > > > >> > > > >>> > > > > java.lang.reflect.Constructor.newInstance(Constructor.java:513) > > > > >> > > > >>> > at > > > > >> > > > >>> > > > > > >> > org.apache.hadoop.ipc.RemoteException.instantiateException(Rem > > > > >> > > > >>> > ot > > > > >> > > > >>> > eExce > > > > >> > > > >>> > ption.java:95) > > > > >> > > > >>> > at > > > > >> > > > >>> > > > > > >> > org.apache.hadoop.ipc.RemoteException.unwrapRemoteException(Re > > > > >> > > > >>> > mo > > > > >> > > > >>> > te > > > > >> > > > >>> > Exception.java:79) > > > > >> > > > >>> > at > > > > >> > > > >>> > > > > > >> > org.apache.hadoop.hbase.regionserver.HRegionServer.tryRegionSe > > > > >> > > > >>> > rv > > > > >> > > > >>> > erRep > > > > >> > > > >>> > ort(HRegionServer.java:735) > > > > >> > > > >>> > at > > > > >> > > > >>> > > > > > >> > > > > > > > >> > > > org.apache.hadoop.hbase.regionserver.HRegionServer.run(HRegionServer > > > > >> > > > .j > > > > >> > > > >>> > ava:596) > > > > >> > > > >>> > at java.lang.Thread.run(Thread.java:662) > > > > >> > > > >>> > Caused by: org.apache.hadoop.ipc.RemoteException: > > > > >> > > > >>> > org.apache.hadoop.hbase.YouAreDeadException: Server > > REPORT > > > > >> > > > >>> > rejected; currently processing > > > > >> > > > >>> > ip-xx-xx-xx-xx.ec2.internal,60020,1324994490741 > > > > >> > > > as > > > > >> > > > >>> > dead server > > > > >> > > > >>> > at > > > > >> > > > >>> > > > > > >> > org.apache.hadoop.hbase.master.ServerManager.checkIsDead(Serve > > > > >> > > > >>> > rM > > > > >> > > > >>> > ana > > > > >> > > > >>> > ger.java:204) > > > > >> > > > >>> > at > > > > >> > > > >>> > > > > > >> > org.apache.hadoop.hbase.master.ServerManager.regionServerRepor > > > > >> > > > >>> > t( > > > > >> > > > >>> > Serv > > > > >> > > > >>> > erManager.java:262) > > > > >> > > > >>> > at > > > > >> > > > >>> > > > > > >> > org.apache.hadoop.hbase.master.HMaster.regionServerReport(HMas > > > > >> > > > >>> > te > > > > >> > > > >>> > r.jav > > > > >> > > > >>> > a:669) > > > > >> > > > >>> > at > > > > sun.reflect.GeneratedMethodAccessor3.invoke(Unknown > > > > >> > > > Source) > > > > >> > > > >>> > at > > > > >> > > > >>> > > > > > >> > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMeth > > > > >> > > > >>> > od > > > > >> > > > >>> > Acces > > > > >> > > > >>> > sorImpl.java:25) > > > > >> > > > >>> > at > > > java.lang.reflect.Method.invoke(Method.java:597) > > > > >> > > > >>> > at > > > > >> > > > >>> > > > > > >> > > > org.apache.hadoop.hbase.ipc.HBaseRPC$Server.call(HBaseRPC.java:570) > > > > >> > > > >>> > at > > > > >> > > > >>> > > > > > >> > > > > > > > >> > > > org.apache.hadoop.hbase.ipc.HBaseServer$Handler.run(HBaseServer.java > > > > >> > > > :1 > > > > >> > > > >>> > 039) > > > > >> > > > >>> > > > > > >> > > > >>> > at > > > > >> > > > >>> > > > > > >> > > org.apache.hadoop.hbase.ipc.HBaseClient.call(HBaseClient.java:771) > > > > >> > > > >>> > at > > > > >> > > > >>> > > > > > >> > org.apache.hadoop.hbase.ipc.HBaseRPC$Invoker.invoke(HBaseRPC.j > > > > >> > > > >>> > av > > > > >> > > > >>> > a:257 > > > > >> > > > >>> > ) > > > > >> > > > >>> > at $Proxy6.regionServerReport(Unknown Source= ) > > > > >> > > > >>> > at > > > > >> > > > >>> > > > > > >> > org.apache.hadoop.hbase.regionserver.HRegionServer.tryRegionSe > > > > >> > > > >>> > rv > > > > >> > > > >>> > erRep > > > > >> > > > >>> > ort(HRegionServer.java:729) > > > > >> > > > >>> > ... 2 more > > > > >> > > > >>> > 11/12/29 00:01:03 INFO regionserver.HRegionServer: > Dump > > of > > > > >> > metrics: > > > > >> > > > >>> > requests=3D66, regions=3D206, stores=3D2078, storefi= les=3D970, > > > > >> > > > >>> > storefileIndexSize=3D78, memstoreSize=3D796, > > > > >> > > > >>> > compactionQueueSize=3D0, flushQueueSize=3D0, > usedHeap=3D1672, > > > > >> > > > >>> > maxHeap=3D4083, blockCacheSize=3D705907552, > > > > >> > > > >>> > blockCacheFree=3D150412064, blockCacheCount=3D10648, > > > > >> > > > >>> > blockCacheHitCount=3D79578618, > > blockCacheMissCount=3D3036335, > > > > >> > > > >>> > blockCacheEvictedCount=3D1401352, blockCacheHitRatio= =3D96, > > > > >> > > > >>> > blockCacheHitCachingRatio=3D98 > > > > >> > > > >>> > 11/12/29 00:01:03 INFO regionserver.HRegionServer: > > > STOPPED: > > > > >> > > > >>> > Unhandled > > > > >> > > > >>> > exception: > org.apache.hadoop.hbase.YouAreDeadException: > > > > >> > Server > > > > >> > > > >>> > REPORT rejected; currently processing > > > > >> > > > >>> > ip-xx-xx-xx-xx.ec2.internal,60020,1324994490741 as > dead > > > > server > > > > >> > > > >>> > 11/12/29 00:01:03 INFO ipc.HBaseServer: Stopping > server > > on > > > > >> > > > >>> > 60020 > > > > >> > > > >>> > --------------------------- > > > > >> > > > >>> > > > > > >> > > > >>> > Then i restart the RegionServer and everything is ba= ck > > to > > > > >> normal. > > > > >> > > > >>> > Reading the DataNode, Zookeeper and TaskTracker logs= , > i > > > > don't > > > > >> > > > >>> > see any abnormality in the same time window. > > > > >> > > > >>> > I think it was caused by the lost of connection to > > > > zookeeper. > > > > >> > > > >>> > Is it > > > > >> > > > >>> advisable to > > > > >> > > > >>> > run zookeeper in the same machines? > > > > >> > > > >>> > if the RegionServer lost it's connection to Zookeepe= r, > > > > there's > > > > >> > > > >>> > a way > > > > >> > > > (a > > > > >> > > > >>> > configuration perhaps) to re-join the cluster, and n= ot > > > only > > > > >> die? > > > > >> > > > >>> > > > > > >> > > > >>> > Any idea what is causing this?? Or to prevent it fro= m > > > > >> happening? > > > > >> > > > >>> > > > > > >> > > > >>> > Any help is appreciated. > > > > >> > > > >>> > > > > > >> > > > >>> > Best Regards, > > > > >> > > > >>> > > > > > >> > > > >>> > -- > > > > >> > > > >>> > > > > > >> > > > >>> > *Leonardo Gamas* > > > > >> > > > >>> > Software Engineer > > > > >> > > > >>> > +557134943514 > > > > >> > > > >>> > +557581347440 > > > > >> > > > >>> > leogamas@jusbrasil.com.br > > > > >> > > > >>> > www.jusbrasil.com.br > > > > >> > > > >>> > > > > >> > > > >> > > > > >> > > > >> > > > > >> > > > >> > > > > >> > > > >> -- > > > > >> > > > >> > > > > >> > > > >> *Leonardo Gamas* > > > > >> > > > >> Software Engineer/Chaos Monkey Engineer T (71) 3494-351= 4C > > > > (75) > > > > >> > > > >> 8134-7440 leogamas@jusbrasil.com.br www.jusbrasil.com.b= r > > > > >> > > > >> > > > > >> > > > >> > > > > >> > > > > > > > > >> > > > > > > > > >> > > > > -- > > > > >> > > > > > > > > >> > > > > *Leonardo Gamas* > > > > >> > > > > Software Engineer/Chaos Monkey Engineer T (71) 3494-3514= C > > > (75) > > > > >> > > > > 8134-7440 leogamas@jusbrasil.com.br www.jusbrasil.com.br > > > > >> > > > > > > > >> > > > > > > > >> > > > > > > >> > > > > > > >> > > -- > > > > >> > > > > > > >> > > *Leonardo Gamas* > > > > >> > > Software Engineer > > > > >> > > +557134943514 > > > > >> > > +557581347440 > > > > >> > > leogamas@jusbrasil.com.br > > > > >> > > www.jusbrasil.com.br > > > > >> > > > > > > >> > > > > > >> > > > > > >> > > > > > >> > -- > > > > >> > > > > > >> > *Leonardo Gamas* > > > > >> > Software Engineer > > > > >> > T +55 (71) 3494-3514 > > > > >> > C +55 (75) 8134-7440 > > > > >> > leogamas@jusbrasil.com.br > > > > >> > www.jusbrasil.com.br > > > > >> > > > > > > > > > > > > > > > > > > > > -- > > > > > > > > > > *Leonardo Gamas* > > > > > > > > > > Software Engineer > > > > > T +55 (71) 3494-3514 > > > > > C +55 (75) 8134-7440 > > > > > leogamas@jusbrasil.com.br > > > > > > > > > > www.jusbrasil.com.br > > > > > > > > > > > > > > > > > > > > > > -- > > > > > > > > *Leonardo Gamas* > > > > Software Engineer > > > > T +55 (71) 3494-3514 > > > > C +55 (75) 8134-7440 > > > > leogamas@jusbrasil.com.br > > > > www.jusbrasil.com.br > > > > > > > > > > > > > > > -- > > > > *Leonardo Gamas* > > Software Engineer > > T +55 (71) 3494-3514 > > C +55 (75) 8134-7440 > > leogamas@jusbrasil.com.br > > www.jusbrasil.com.br > > > --=20 *Leonardo Gamas* Software Engineer T +55 (71) 3494-3514 C +55 (75) 8134-7440 leogamas@jusbrasil.com.br www.jusbrasil.com.br --0016e6dedde1adca4104b7096fa8--