hbase-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ted Yu <yuzhih...@gmail.com>
Subject Re: YouAreDeadException
Date Sun, 16 Jan 2011 01:02:33 GMT
For #1, what string should I look for in region server log ?
For #4, what's the rationale behind sending YADE after receiving heartbeat ?
I thought heartbeat means the RS is alive.

Thanks

On Sat, Jan 15, 2011 at 4:49 PM, Stack <stack@duboce.net> wrote:

> FYI Ted, the YourAreDeadException usually happens in following context:
>
> 1. Regionserver has some kinda issue -- long GC pause for instance --
> and it stops tickling zk.
> 2. Master gets zk session expired event.  Starts up recovery of the hung
> region.
> 3. Regionserver recovers but has not yet processed its session expired
> event.  It heartbeats the Master as though nothing wrong.
> 4. Master is mid-recovery or beyond server recovery and on receipt of
> the heartbeat in essence tells the regionserver to 'go away' by
> sending him the YouAreDeadException.
> 5. By now the regionserver will have gotten its session expired
> notification and will have started an abort so the YADE is not news
> when it receives the exception.
>
> St.Ack
>
>
> On Fri, Jan 14, 2011 at 7:49 PM, Ted Yu <yuzhihong@gmail.com> wrote:
> > Thanks for your analysis, Ryan.
> > The dev cluster has half as many nodes as our staging cluster. Each node
> has
> > half the number of cores as the node in staging.
> >
> > I agree with your conclusion.
> >
> > I will report back after I collect more data - the flow uses hbase
> heavily
> > toward the end.
> >
> > On Fri, Jan 14, 2011 at 6:20 PM, Ryan Rawson <ryanobjc@gmail.com> wrote:
> >
> >> I'm seeing not much in the way of errors, timeouts, all to one machine
> >> ending with .80, so that is probably your failed node.
> >>
> >> Other than that, the log doesnt seem to say too much.  Searching for
> >> strings like FATAL and Exception is the way to go here.
> >>
> >> Also things like this:
> >> 2011-01-14 23:38:52,936 INFO
> >> org.apache.hadoop.hbase.master.AssignmentManager: Region has been
> >> PENDING_OPEN for too long, reassigning region=
> >>
> >>
> NIGHTLYDEVGRIDSGRIDSQL-THREEGPPSPEECHCALLS-1294897314309,@v[h\xE2%\x83\xD4\xAC@v
> >> [h\xE2%\x83\xD4\xAC@v[h\xE2%\x83\xD4\xAC@v[h\xDC,129489731602
> >> 7.2c40637c6c648a67162cc38d8c6d8ee9.
> >>
> >>
> >> Guessing, I'd probably say your nodes hit some performance wall, with
> >> io-wait, or networking, or something, and Regionserver processes
> >> stopped responding, but did not time out from zookeeper yet... so you
> >> would run into a situation where some nodes are unresponsive, so any
> >> data hosted there would be difficult to talk to.  Until the
> >> regionserver times out it's zookeeper node, the master doesnt know
> >> about the fault of the regionserver.
> >>
> >> The master web UI is probably inaccessible because the META table is
> >> on a regionserver that went AWOL.  You should check your load, your
> >> ganglia graphs.  Also remember, despite having lots of disks, each
> >> node is a gigabit ethernet which means about 110-120 MB/sec.  It's
> >> quite possible you are running into network limitations, remember that
> >> regionservers must write to 2 additional datanodes, and there will be
> >> overlap, thus you have to share some of that 110-120MB/sec per node
> >> figure with other nodes, not to mention that you also need to factor
> >> inbound bandwidth (from client->hbase regionserver) and outbound
> >> bandwidth (from datanode replica 1 -> dn replica 2).
> >>
> >> -ryan
> >>
> >> On Fri, Jan 14, 2011 at 3:57 PM, Ted Yu <yuzhihong@gmail.com> wrote:
> >> > Now I cannot access master web UI, This happened after I doubled the
> >> amount
> >> > of data processed in our flow.
> >> >
> >> > I am attaching master log.
> >> >
> >> > On Fri, Jan 14, 2011 at 3:10 PM, Ryan Rawson <ryanobjc@gmail.com>
> wrote:
> >> >>
> >> >> This is the cause:
> >> >>
> >> >> org.apache.hadoop.hbase.regionserver.HRegionServer: ABORTING region
> >> server
> >> >> serverName=sjc1-hadoop1.sjc1.carrieriq.com,60020,1294856823378,
> >> >> load=(requests=0, regions=6, usedHeap=514, maxHeap=3983):
> >> >> regionserver:60020-0x12d7b7b1c760004
> >> regionserver:60020-0x12d7b7b1c760004
> >> >> received expired from ZooKeeper, aborting
> >> >> org.apache.zookeeper.KeeperException$SessionExpiredException:
> >> >>
> >> >> Why did the session expire?  Typically it's GC, what does your GC
> logs
> >> >> say?  Otherwise, network issues perhaps?  Swapping?  Other machine
> >> >> related systems problems?
> >> >>
> >> >> -ryan
> >> >>
> >> >>
> >> >> On Fri, Jan 14, 2011 at 3:04 PM, Ted Yu <yuzhihong@gmail.com>
wrote:
> >> >> > I ran 0.90 RC3 in dev cluster.
> >> >> >
> >> >> > I saw the following in region server log:
> >> >> >
> >> >> > Caused by: org.apache.hadoop.ipc.RemoteException:
> >> >> > org.apache.hadoop.hbase.YouAreDeadException: Server REPORT
> rejected;
> >> >> > currently processing sjc1-hadoop1.sjc1.carrieriq.com
> >> ,60020,1294856823378
> >> >> > as
> >> >> > dead server
> >> >> >    at
> >> >> >
> >> >> >
> >>
> org.apache.hadoop.hbase.master.ServerManager.checkIsDead(ServerManager.java:197)
> >> >> >    at
> >> >> >
> >> >> >
> >>
> org.apache.hadoop.hbase.master.ServerManager.regionServerReport(ServerManager.java:247)
> >> >> >    at
> >> >> >
> >> >> >
> >>
> org.apache.hadoop.hbase.master.HMaster.regionServerReport(HMaster.java:648)
> >> >> >    at sun.reflect.GeneratedMethodAccessor3.invoke(Unknown Source)
> >> >> >    at
> >> >> >
> >> >> >
> >>
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
> >> >> >    at java.lang.reflect.Method.invoke(Method.java:597)
> >> >> >    at
> >> >> > org.apache.hadoop.hbase.ipc.HBaseRPC$Server.call(HBaseRPC.java:570)
> >> >> >    at
> >> >> >
> >> >> >
> >>
> org.apache.hadoop.hbase.ipc.HBaseServer$Handler.run(HBaseServer.java:1036)
> >> >> >
> >> >> >    at
> >> org.apache.hadoop.hbase.ipc.HBaseClient.call(HBaseClient.java:753)
> >> >> >    at
> >> >> >
> org.apache.hadoop.hbase.ipc.HBaseRPC$Invoker.invoke(HBaseRPC.java:257)
> >> >> >    at $Proxy0.regionServerReport(Unknown Source)
> >> >> >    at
> >> >> >
> >> >> >
> >>
> org.apache.hadoop.hbase.regionserver.HRegionServer.tryRegionServerReport(HRegionServer.java:702)
> >> >> >    ... 2 more
> >> >> > 2011-01-13 03:55:08,982 INFO org.apache.zookeeper.ZooKeeper:
> >> Initiating
> >> >> > client connection,
> >> >> > connectString=sjc1-hadoop0.sjc1.carrieriq.com:2181
> >> sessionTimeout=90000
> >> >> > watcher=hconnection
> >> >> > 2011-01-13 03:55:08,914 FATAL
> >> >> > org.apache.hadoop.hbase.regionserver.HRegionServer: ABORTING region
> >> >> > server
> >> >> > serverName=sjc1-hadoop1.sjc1.carrieriq.com,60020,1294856823378,
> >> >> > load=(requests=0, regions=6, usedHeap=514, maxHeap=3983):
> >> >> > regionserver:60020-0x12d7b7b1c760004
> >> >> > regionserver:60020-0x12d7b7b1c760004
> >> >> > received expired from ZooKeeper, aborting
> >> >> > org.apache.zookeeper.KeeperException$SessionExpiredException:
> >> >> > KeeperErrorCode = Session expired
> >> >> >    at
> >> >> >
> >> >> >
> >>
> org.apache.hadoop.hbase.zookeeper.ZooKeeperWatcher.connectionEvent(ZooKeeperWatcher.java:328)
> >> >> >    at
> >> >> >
> >> >> >
> >>
> org.apache.hadoop.hbase.zookeeper.ZooKeeperWatcher.process(ZooKeeperWatcher.java:246)
> >> >> >    at
> >> >> >
> >> >> >
> >>
> org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:530)
> >> >> >    at
> >> >> >
> org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:506)
> >> >> >
> >> >> > ---------------
> >> >> >
> >> >> > And the following from master log:
> >> >> >
> >> >> > 2011-01-13 03:52:42,003 INFO
> >> >> > org.apache.hadoop.hbase.zookeeper.RegionServerTracker: RegionServer
> >> >> > ephemeral node deleted, processing expiration [
> >> >> > sjc1-hadoop1.sjc1.carrieriq.com,60020,1294856823378]
> >> >> > 2011-01-13 03:52:42,005 DEBUG
> >> >> > org.apache.hadoop.hbase.master.ServerManager:
> >> >> > Added=sjc1-hadoop1.sjc1.carrieriq.com,60020,1294856823378 to dead
> >> >> > servers,
> >> >> > submitted shutdown handler to be executed, root=false, meta=false
> >> >> > 2011-01-13 03:52:42,005 INFO
> >> >> > org.apache.hadoop.hbase.master.handler.ServerShutdownHandler:
> >> Splitting
> >> >> > logs
> >> >> > for sjc1-hadoop1.sjc1.carrieriq.com,60020,1294856823378
> >> >> > 2011-01-13 03:52:42,092 INFO
> >> >> > org.apache.hadoop.hbase.regionserver.wal.HLogSplitter: Splitting
1
> >> >> > hlog(s)
> >> >> > in hdfs://
> >> >> >
> >> >> >
> >>
> sjc1-hadoop0.sjc1.carrieriq.com:9000/hbase/.logs/sjc1-hadoop1.sjc1.carrieriq.com,60020,1294856823378
> >> >> > 2011-01-13 03:52:42,093 DEBUG
> >> >> > org.apache.hadoop.hbase.regionserver.wal.HLogSplitter: Writer
> thread
> >> >> > Thread[WriterThread-0,5,main]: starting
> >> >> > 2011-01-13 03:52:42,094 DEBUG
> >> >> > org.apache.hadoop.hbase.regionserver.wal.HLogSplitter: Writer
> thread
> >> >> > Thread[WriterThread-1,5,main]: starting
> >> >> > 2011-01-13 03:52:42,096 DEBUG
> >> >> > org.apache.hadoop.hbase.regionserver.wal.HLogSplitter: Splitting
> hlog
> >> 1
> >> >> > of
> >> >> > 1: hdfs://
> >> >> >
> >> >> >
> >>
> sjc1-hadoop0.sjc1.carrieriq.com:9000/hbase/.logs/sjc1-hadoop1.sjc1.carrieriq.com,60020,1294856823378/sjc1-hadoop1.sjc1.carrieriq.com%3A60020.1294860449407
> >> ,
> >> >> > length=0
> >> >> >
> >> >> > Please advise what could be the cause.
> >> >> >
> >> >> > Thanks
> >> >> >
> >> >
> >> >
> >>
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message