hbase-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Stack <st...@duboce.net>
Subject Re: YouAreDeadException
Date Sun, 16 Jan 2011 00:49:50 GMT
FYI Ted, the YourAreDeadException usually happens in following context:

1. Regionserver has some kinda issue -- long GC pause for instance --
and it stops tickling zk.
2. Master gets zk session expired event.  Starts up recovery of the hung region.
3. Regionserver recovers but has not yet processed its session expired
event.  It heartbeats the Master as though nothing wrong.
4. Master is mid-recovery or beyond server recovery and on receipt of
the heartbeat in essence tells the regionserver to 'go away' by
sending him the YouAreDeadException.
5. By now the regionserver will have gotten its session expired
notification and will have started an abort so the YADE is not news
when it receives the exception.

St.Ack


On Fri, Jan 14, 2011 at 7:49 PM, Ted Yu <yuzhihong@gmail.com> wrote:
> Thanks for your analysis, Ryan.
> The dev cluster has half as many nodes as our staging cluster. Each node has
> half the number of cores as the node in staging.
>
> I agree with your conclusion.
>
> I will report back after I collect more data - the flow uses hbase heavily
> toward the end.
>
> On Fri, Jan 14, 2011 at 6:20 PM, Ryan Rawson <ryanobjc@gmail.com> wrote:
>
>> I'm seeing not much in the way of errors, timeouts, all to one machine
>> ending with .80, so that is probably your failed node.
>>
>> Other than that, the log doesnt seem to say too much.  Searching for
>> strings like FATAL and Exception is the way to go here.
>>
>> Also things like this:
>> 2011-01-14 23:38:52,936 INFO
>> org.apache.hadoop.hbase.master.AssignmentManager: Region has been
>> PENDING_OPEN for too long, reassigning region=
>>
>> NIGHTLYDEVGRIDSGRIDSQL-THREEGPPSPEECHCALLS-1294897314309,@v[h\xE2%\x83\xD4\xAC@v
>> [h\xE2%\x83\xD4\xAC@v[h\xE2%\x83\xD4\xAC@v[h\xDC,129489731602
>> 7.2c40637c6c648a67162cc38d8c6d8ee9.
>>
>>
>> Guessing, I'd probably say your nodes hit some performance wall, with
>> io-wait, or networking, or something, and Regionserver processes
>> stopped responding, but did not time out from zookeeper yet... so you
>> would run into a situation where some nodes are unresponsive, so any
>> data hosted there would be difficult to talk to.  Until the
>> regionserver times out it's zookeeper node, the master doesnt know
>> about the fault of the regionserver.
>>
>> The master web UI is probably inaccessible because the META table is
>> on a regionserver that went AWOL.  You should check your load, your
>> ganglia graphs.  Also remember, despite having lots of disks, each
>> node is a gigabit ethernet which means about 110-120 MB/sec.  It's
>> quite possible you are running into network limitations, remember that
>> regionservers must write to 2 additional datanodes, and there will be
>> overlap, thus you have to share some of that 110-120MB/sec per node
>> figure with other nodes, not to mention that you also need to factor
>> inbound bandwidth (from client->hbase regionserver) and outbound
>> bandwidth (from datanode replica 1 -> dn replica 2).
>>
>> -ryan
>>
>> On Fri, Jan 14, 2011 at 3:57 PM, Ted Yu <yuzhihong@gmail.com> wrote:
>> > Now I cannot access master web UI, This happened after I doubled the
>> amount
>> > of data processed in our flow.
>> >
>> > I am attaching master log.
>> >
>> > On Fri, Jan 14, 2011 at 3:10 PM, Ryan Rawson <ryanobjc@gmail.com> wrote:
>> >>
>> >> This is the cause:
>> >>
>> >> org.apache.hadoop.hbase.regionserver.HRegionServer: ABORTING region
>> server
>> >> serverName=sjc1-hadoop1.sjc1.carrieriq.com,60020,1294856823378,
>> >> load=(requests=0, regions=6, usedHeap=514, maxHeap=3983):
>> >> regionserver:60020-0x12d7b7b1c760004
>> regionserver:60020-0x12d7b7b1c760004
>> >> received expired from ZooKeeper, aborting
>> >> org.apache.zookeeper.KeeperException$SessionExpiredException:
>> >>
>> >> Why did the session expire?  Typically it's GC, what does your GC logs
>> >> say?  Otherwise, network issues perhaps?  Swapping?  Other machine
>> >> related systems problems?
>> >>
>> >> -ryan
>> >>
>> >>
>> >> On Fri, Jan 14, 2011 at 3:04 PM, Ted Yu <yuzhihong@gmail.com> wrote:
>> >> > I ran 0.90 RC3 in dev cluster.
>> >> >
>> >> > I saw the following in region server log:
>> >> >
>> >> > Caused by: org.apache.hadoop.ipc.RemoteException:
>> >> > org.apache.hadoop.hbase.YouAreDeadException: Server REPORT rejected;
>> >> > currently processing sjc1-hadoop1.sjc1.carrieriq.com
>> ,60020,1294856823378
>> >> > as
>> >> > dead server
>> >> >    at
>> >> >
>> >> >
>> org.apache.hadoop.hbase.master.ServerManager.checkIsDead(ServerManager.java:197)
>> >> >    at
>> >> >
>> >> >
>> org.apache.hadoop.hbase.master.ServerManager.regionServerReport(ServerManager.java:247)
>> >> >    at
>> >> >
>> >> >
>> org.apache.hadoop.hbase.master.HMaster.regionServerReport(HMaster.java:648)
>> >> >    at sun.reflect.GeneratedMethodAccessor3.invoke(Unknown Source)
>> >> >    at
>> >> >
>> >> >
>> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
>> >> >    at java.lang.reflect.Method.invoke(Method.java:597)
>> >> >    at
>> >> > org.apache.hadoop.hbase.ipc.HBaseRPC$Server.call(HBaseRPC.java:570)
>> >> >    at
>> >> >
>> >> >
>> org.apache.hadoop.hbase.ipc.HBaseServer$Handler.run(HBaseServer.java:1036)
>> >> >
>> >> >    at
>> org.apache.hadoop.hbase.ipc.HBaseClient.call(HBaseClient.java:753)
>> >> >    at
>> >> > org.apache.hadoop.hbase.ipc.HBaseRPC$Invoker.invoke(HBaseRPC.java:257)
>> >> >    at $Proxy0.regionServerReport(Unknown Source)
>> >> >    at
>> >> >
>> >> >
>> org.apache.hadoop.hbase.regionserver.HRegionServer.tryRegionServerReport(HRegionServer.java:702)
>> >> >    ... 2 more
>> >> > 2011-01-13 03:55:08,982 INFO org.apache.zookeeper.ZooKeeper:
>> Initiating
>> >> > client connection,
>> >> > connectString=sjc1-hadoop0.sjc1.carrieriq.com:2181
>> sessionTimeout=90000
>> >> > watcher=hconnection
>> >> > 2011-01-13 03:55:08,914 FATAL
>> >> > org.apache.hadoop.hbase.regionserver.HRegionServer: ABORTING region
>> >> > server
>> >> > serverName=sjc1-hadoop1.sjc1.carrieriq.com,60020,1294856823378,
>> >> > load=(requests=0, regions=6, usedHeap=514, maxHeap=3983):
>> >> > regionserver:60020-0x12d7b7b1c760004
>> >> > regionserver:60020-0x12d7b7b1c760004
>> >> > received expired from ZooKeeper, aborting
>> >> > org.apache.zookeeper.KeeperException$SessionExpiredException:
>> >> > KeeperErrorCode = Session expired
>> >> >    at
>> >> >
>> >> >
>> org.apache.hadoop.hbase.zookeeper.ZooKeeperWatcher.connectionEvent(ZooKeeperWatcher.java:328)
>> >> >    at
>> >> >
>> >> >
>> org.apache.hadoop.hbase.zookeeper.ZooKeeperWatcher.process(ZooKeeperWatcher.java:246)
>> >> >    at
>> >> >
>> >> >
>> org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:530)
>> >> >    at
>> >> > org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:506)
>> >> >
>> >> > ---------------
>> >> >
>> >> > And the following from master log:
>> >> >
>> >> > 2011-01-13 03:52:42,003 INFO
>> >> > org.apache.hadoop.hbase.zookeeper.RegionServerTracker: RegionServer
>> >> > ephemeral node deleted, processing expiration [
>> >> > sjc1-hadoop1.sjc1.carrieriq.com,60020,1294856823378]
>> >> > 2011-01-13 03:52:42,005 DEBUG
>> >> > org.apache.hadoop.hbase.master.ServerManager:
>> >> > Added=sjc1-hadoop1.sjc1.carrieriq.com,60020,1294856823378 to dead
>> >> > servers,
>> >> > submitted shutdown handler to be executed, root=false, meta=false
>> >> > 2011-01-13 03:52:42,005 INFO
>> >> > org.apache.hadoop.hbase.master.handler.ServerShutdownHandler:
>> Splitting
>> >> > logs
>> >> > for sjc1-hadoop1.sjc1.carrieriq.com,60020,1294856823378
>> >> > 2011-01-13 03:52:42,092 INFO
>> >> > org.apache.hadoop.hbase.regionserver.wal.HLogSplitter: Splitting 1
>> >> > hlog(s)
>> >> > in hdfs://
>> >> >
>> >> >
>> sjc1-hadoop0.sjc1.carrieriq.com:9000/hbase/.logs/sjc1-hadoop1.sjc1.carrieriq.com,60020,1294856823378
>> >> > 2011-01-13 03:52:42,093 DEBUG
>> >> > org.apache.hadoop.hbase.regionserver.wal.HLogSplitter: Writer thread
>> >> > Thread[WriterThread-0,5,main]: starting
>> >> > 2011-01-13 03:52:42,094 DEBUG
>> >> > org.apache.hadoop.hbase.regionserver.wal.HLogSplitter: Writer thread
>> >> > Thread[WriterThread-1,5,main]: starting
>> >> > 2011-01-13 03:52:42,096 DEBUG
>> >> > org.apache.hadoop.hbase.regionserver.wal.HLogSplitter: Splitting hlog
>> 1
>> >> > of
>> >> > 1: hdfs://
>> >> >
>> >> >
>> sjc1-hadoop0.sjc1.carrieriq.com:9000/hbase/.logs/sjc1-hadoop1.sjc1.carrieriq.com,60020,1294856823378/sjc1-hadoop1.sjc1.carrieriq.com%3A60020.1294860449407
>> ,
>> >> > length=0
>> >> >
>> >> > Please advise what could be the cause.
>> >> >
>> >> > Thanks
>> >> >
>> >
>> >
>>
>

Mime
View raw message