hbase-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ryan Rawson <ryano...@gmail.com>
Subject Re: YouAreDeadException
Date Sat, 15 Jan 2011 02:20:54 GMT
I'm seeing not much in the way of errors, timeouts, all to one machine
ending with .80, so that is probably your failed node.

Other than that, the log doesnt seem to say too much.  Searching for
strings like FATAL and Exception is the way to go here.

Also things like this:
2011-01-14 23:38:52,936 INFO
org.apache.hadoop.hbase.master.AssignmentManager: Region has been
PENDING_OPEN for too long, reassigning region=
NIGHTLYDEVGRIDSGRIDSQL-THREEGPPSPEECHCALLS-1294897314309,@v[h\xE2%\x83\xD4\xAC@v[h\xE2%\x83\xD4\xAC@v[h\xE2%\x83\xD4\xAC@v[h\xDC,129489731602
7.2c40637c6c648a67162cc38d8c6d8ee9.


Guessing, I'd probably say your nodes hit some performance wall, with
io-wait, or networking, or something, and Regionserver processes
stopped responding, but did not time out from zookeeper yet... so you
would run into a situation where some nodes are unresponsive, so any
data hosted there would be difficult to talk to.  Until the
regionserver times out it's zookeeper node, the master doesnt know
about the fault of the regionserver.

The master web UI is probably inaccessible because the META table is
on a regionserver that went AWOL.  You should check your load, your
ganglia graphs.  Also remember, despite having lots of disks, each
node is a gigabit ethernet which means about 110-120 MB/sec.  It's
quite possible you are running into network limitations, remember that
regionservers must write to 2 additional datanodes, and there will be
overlap, thus you have to share some of that 110-120MB/sec per node
figure with other nodes, not to mention that you also need to factor
inbound bandwidth (from client->hbase regionserver) and outbound
bandwidth (from datanode replica 1 -> dn replica 2).

-ryan

On Fri, Jan 14, 2011 at 3:57 PM, Ted Yu <yuzhihong@gmail.com> wrote:
> Now I cannot access master web UI, This happened after I doubled the amount
> of data processed in our flow.
>
> I am attaching master log.
>
> On Fri, Jan 14, 2011 at 3:10 PM, Ryan Rawson <ryanobjc@gmail.com> wrote:
>>
>> This is the cause:
>>
>> org.apache.hadoop.hbase.regionserver.HRegionServer: ABORTING region server
>> serverName=sjc1-hadoop1.sjc1.carrieriq.com,60020,1294856823378,
>> load=(requests=0, regions=6, usedHeap=514, maxHeap=3983):
>> regionserver:60020-0x12d7b7b1c760004 regionserver:60020-0x12d7b7b1c760004
>> received expired from ZooKeeper, aborting
>> org.apache.zookeeper.KeeperException$SessionExpiredException:
>>
>> Why did the session expire?  Typically it's GC, what does your GC logs
>> say?  Otherwise, network issues perhaps?  Swapping?  Other machine
>> related systems problems?
>>
>> -ryan
>>
>>
>> On Fri, Jan 14, 2011 at 3:04 PM, Ted Yu <yuzhihong@gmail.com> wrote:
>> > I ran 0.90 RC3 in dev cluster.
>> >
>> > I saw the following in region server log:
>> >
>> > Caused by: org.apache.hadoop.ipc.RemoteException:
>> > org.apache.hadoop.hbase.YouAreDeadException: Server REPORT rejected;
>> > currently processing sjc1-hadoop1.sjc1.carrieriq.com,60020,1294856823378
>> > as
>> > dead server
>> >    at
>> >
>> > org.apache.hadoop.hbase.master.ServerManager.checkIsDead(ServerManager.java:197)
>> >    at
>> >
>> > org.apache.hadoop.hbase.master.ServerManager.regionServerReport(ServerManager.java:247)
>> >    at
>> >
>> > org.apache.hadoop.hbase.master.HMaster.regionServerReport(HMaster.java:648)
>> >    at sun.reflect.GeneratedMethodAccessor3.invoke(Unknown Source)
>> >    at
>> >
>> > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
>> >    at java.lang.reflect.Method.invoke(Method.java:597)
>> >    at
>> > org.apache.hadoop.hbase.ipc.HBaseRPC$Server.call(HBaseRPC.java:570)
>> >    at
>> >
>> > org.apache.hadoop.hbase.ipc.HBaseServer$Handler.run(HBaseServer.java:1036)
>> >
>> >    at org.apache.hadoop.hbase.ipc.HBaseClient.call(HBaseClient.java:753)
>> >    at
>> > org.apache.hadoop.hbase.ipc.HBaseRPC$Invoker.invoke(HBaseRPC.java:257)
>> >    at $Proxy0.regionServerReport(Unknown Source)
>> >    at
>> >
>> > org.apache.hadoop.hbase.regionserver.HRegionServer.tryRegionServerReport(HRegionServer.java:702)
>> >    ... 2 more
>> > 2011-01-13 03:55:08,982 INFO org.apache.zookeeper.ZooKeeper: Initiating
>> > client connection,
>> > connectString=sjc1-hadoop0.sjc1.carrieriq.com:2181sessionTimeout=90000
>> > watcher=hconnection
>> > 2011-01-13 03:55:08,914 FATAL
>> > org.apache.hadoop.hbase.regionserver.HRegionServer: ABORTING region
>> > server
>> > serverName=sjc1-hadoop1.sjc1.carrieriq.com,60020,1294856823378,
>> > load=(requests=0, regions=6, usedHeap=514, maxHeap=3983):
>> > regionserver:60020-0x12d7b7b1c760004
>> > regionserver:60020-0x12d7b7b1c760004
>> > received expired from ZooKeeper, aborting
>> > org.apache.zookeeper.KeeperException$SessionExpiredException:
>> > KeeperErrorCode = Session expired
>> >    at
>> >
>> > org.apache.hadoop.hbase.zookeeper.ZooKeeperWatcher.connectionEvent(ZooKeeperWatcher.java:328)
>> >    at
>> >
>> > org.apache.hadoop.hbase.zookeeper.ZooKeeperWatcher.process(ZooKeeperWatcher.java:246)
>> >    at
>> >
>> > org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:530)
>> >    at
>> > org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:506)
>> >
>> > ---------------
>> >
>> > And the following from master log:
>> >
>> > 2011-01-13 03:52:42,003 INFO
>> > org.apache.hadoop.hbase.zookeeper.RegionServerTracker: RegionServer
>> > ephemeral node deleted, processing expiration [
>> > sjc1-hadoop1.sjc1.carrieriq.com,60020,1294856823378]
>> > 2011-01-13 03:52:42,005 DEBUG
>> > org.apache.hadoop.hbase.master.ServerManager:
>> > Added=sjc1-hadoop1.sjc1.carrieriq.com,60020,1294856823378 to dead
>> > servers,
>> > submitted shutdown handler to be executed, root=false, meta=false
>> > 2011-01-13 03:52:42,005 INFO
>> > org.apache.hadoop.hbase.master.handler.ServerShutdownHandler: Splitting
>> > logs
>> > for sjc1-hadoop1.sjc1.carrieriq.com,60020,1294856823378
>> > 2011-01-13 03:52:42,092 INFO
>> > org.apache.hadoop.hbase.regionserver.wal.HLogSplitter: Splitting 1
>> > hlog(s)
>> > in hdfs://
>> >
>> > sjc1-hadoop0.sjc1.carrieriq.com:9000/hbase/.logs/sjc1-hadoop1.sjc1.carrieriq.com,60020,1294856823378
>> > 2011-01-13 03:52:42,093 DEBUG
>> > org.apache.hadoop.hbase.regionserver.wal.HLogSplitter: Writer thread
>> > Thread[WriterThread-0,5,main]: starting
>> > 2011-01-13 03:52:42,094 DEBUG
>> > org.apache.hadoop.hbase.regionserver.wal.HLogSplitter: Writer thread
>> > Thread[WriterThread-1,5,main]: starting
>> > 2011-01-13 03:52:42,096 DEBUG
>> > org.apache.hadoop.hbase.regionserver.wal.HLogSplitter: Splitting hlog 1
>> > of
>> > 1: hdfs://
>> >
>> > sjc1-hadoop0.sjc1.carrieriq.com:9000/hbase/.logs/sjc1-hadoop1.sjc1.carrieriq.com,60020,1294856823378/sjc1-hadoop1.sjc1.carrieriq.com%3A60020.1294860449407,
>> > length=0
>> >
>> > Please advise what could be the cause.
>> >
>> > Thanks
>> >
>
>

Mime
View raw message