hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jean-Daniel Cryans <jdcry...@apache.org>
Subject Re: YouAreDeadException after DFS Client errors.
Date Tue, 12 Apr 2011 18:45:13 GMT
YouAreDead means that the master is already processing the death of
those region servers when the region server talks back to the master.
Network split?

J-D

On Tue, Apr 12, 2011 at 11:33 AM, Vidhyashankar Venkataraman
<vidhyash@yahoo-inc.com> wrote:
> This was something that happened a week back in our cluster: There was a flash death
of region servers: a few of the region servers did have near-full heaps so I thought GC could
be at play. But many of them crashed after a few DFS errors followed by a YouAreDeadException
and they didn't have GC problems..
>
> This was in a 700 node cluster. Writes happen only through bulk loads. 50 regions per
region server.
>
> After we restarted the cluster, it started running fine.
>
>
> ------ Forwarded Message
> From: Vidhyashankar Venkataraman <vidhyash@yahoo-inc.com>
> Date: Tue, 12 Apr 2011 09:44:24 -0700
> To: stack <saint.ack@gmail.com>
> Conversation: Hbase configs.
> Subject: Re: Hbase configs.
>
> As for the DFS errors:  This was a sample log from one of the region servers that had
this flash death last week: Notice that the final nail in the coffin was the YouAreDeadException
because of ZK unable to receive a timely update from the regionserver: Usually this happens
when the heap is full but that doesn't seem to be the case (grep for FATAL in the following
log). But this was preceded by quite a few DFSClient errors. I have usually noticed these
DFSClient errors usually when there are too many files floating around: for now, we havent
been able to reproduce these errors as such. Please note that splits werent disabled. Compactions
were happening once every 5-7 hours per region. 50 regions per node. And the max file/region
size was 8 gigs.
>
> 2011-04-04 01:27:28,273 INFO org.apache.hadoop.hdfs.DFSClient: org.apache.hadoop.ipc.RemoteException:
org.apache.hadoop.hdfs.server.namenode.NotReplicatedYetException: Not replicated yet:/hbase/WCC/98936987f714a2044103bd3b424e6148/.tmp/1182802151884288931
> 2011-04-04 01:27:28,273 WARN org.apache.hadoop.hdfs.DFSClient: NotReplicatedYetException
sleeping /hbase/WCC/98936987f714a2044103bd3b424e6148/.tmp/1182802151884288931 retries left
4
> 2011-04-04 01:27:28,759 INFO org.apache.hadoop.hdfs.DFSClient: org.apache.hadoop.ipc.RemoteException:
org.apache.hadoop.hdfs.server.namenode.NotReplicatedYetException: Not replicated yet:/hbase/WCC/98936987f714a2044103bd3b424e6148/.tmp/1182802151884288931
> 2011-04-04 01:27:28,759 WARN org.apache.hadoop.hdfs.DFSClient: NotReplicatedYetException
sleeping /hbase/WCC/98936987f714a2044103bd3b424e6148/.tmp/1182802151884288931 retries left
3
> 2011-04-04 01:27:29,562 INFO org.apache.hadoop.hdfs.DFSClient: org.apache.hadoop.ipc.RemoteException:
org.apache.hadoop.hdfs.server.namenode.NotReplicatedYetException: Not replicated yet:/hbase/WCC/98936987f714a2044103bd3b424e6148/.tmp/1182802151884288931
> 2011-04-04 01:27:29,562 WARN org.apache.hadoop.hdfs.DFSClient: NotReplicatedYetException
sleeping /hbase/WCC/98936987f714a2044103bd3b424e6148/.tmp/1182802151884288931 retries left
2
> java.net.SocketTimeoutException: 20000 millis timeout while waiting for channel to be
ready for connect. ch : java.nio.channels.SocketChannel[connection-pending remote=b3130001.yst.yahoo.net/67.195.49.108:60020]
> 2011-04-04 03:32:41,280 INFO org.apache.hadoop.hbase.catalog.CatalogTracker: Failed verification
of .META.,,1 at address=b3130001.yst.yahoo.net:60020; java.net.ConnectException: Connection
refused
> 2011-04-04 05:23:52,460 INFO org.apache.hadoop.hdfs.DFSClient: Could not obtain block
blk_-3142768255648714362_53688655 from any node: java.io.IOException: No live nodes contain
current block. Will get new block locations from namenode and retry...
> 2011-04-04 09:57:14,606 INFO org.apache.hadoop.hdfs.DFSClient: Could not obtain block
blk_-8807886313487927171_53799733 from any node: java.io.IOException: No live nodes contain
current block. Will get new block locations from namenode and retry...
> java.io.InterruptedIOException: Aborting compaction of store content in region WCC,r:jp#co#toptour#topsrv1!/searchtour/domestic/tur_lst.php?dst_are1=10&dst_dit1=50%2C173&dst_chg=1&dp_d=&dpt=2&tur_prd=&tur_sty=&sel_ple=2&mny_sm=&mny_bg=&air_typ=&grp=31&cal_flg=1&dst_flg=0&are1=10&cty1=50&dit1=173&grp=31&are2=&cty2=&dit2=&are3=&cty3=&dit3=&pps=&agt=&sort=1&htl=&opt_flg7=&opt_flg8=&opt_flg9=&kwd_dst=&kwd_htl=!http,1301454677216.9ca97f291d075c375143d3e65de1168c.
because user requested stop.
> bash-3.00$ grep FATAL hbase-crawler-regionserver-b3130123.yst.yahoo.net.log.2011-04-04
> bash-3.00$ grep FATAL hbase-crawler-regionserver-b3130123.yst.yahoo.net.log.2011-04-05
> 2011-04-05 05:15:38,940 FATAL org.apache.hadoop.hbase.regionserver.HRegionServer: ABORTING
region server serverName=b3130123.yst.yahoo.net,60020,1301703415323, load=(requests=0, regions=77,
usedHeap=2521, maxHeap=7993): regionserver:60020-0x22f137aaab00096-0x22f137aaab00096 regionserver:60020-0x22f137aaab00096-0x22f137aaab00096
received expired from ZooKeeper, aborting
> 2011-04-05 05:15:39,022 FATAL org.apache.hadoop.hbase.regionserver.HRegionServer: ABORTING
region server serverName=b3130123.yst.yahoo.net,60020,1301703415323, load=(requests=0, regions=77,
usedHeap=2521, maxHeap=7993): Unhandled exception: org.apache.hadoop.hbase.YouAreDeadException:
Server REPORT rejected; currently processing b3130123.yst.yahoo.net,60020,1301703415323 as
dead server
> bash-3.00$ grep -b35 -a30 FATAL hbase-crawler-regionserver-b3130123.yst.yahoo.net.log.2011-04-05
> 607897318-    at javax.security.auth.Subject.doAs(Subject.java:396)
> 607897373-    at org.apache.hadoop.ipc.Server$Handler.run(Server.java:956)
> 607897435-
> 607897436-    at org.apache.hadoop.ipc.Client.call(Client.java:742)
> 607897491-    at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:220)
> 607897550-    at $Proxy4.commitBlockSynchronization(Unknown Source)
> 607897605-    at org.apache.hadoop.hdfs.server.datanode.DataNode.syncBlock(DataNode.java:1570)
> 607897687-    at org.apache.hadoop.hdfs.server.datanode.DataNode.recoverBlock(DataNode.java:1551)
> 607897772-    at org.apache.hadoop.hdfs.server.datanode.DataNode.recoverBlock(DataNode.java:1617)
> 607897857-    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> 607897921-    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
> 607898003-    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
> 607898093-    at java.lang.reflect.Method.invoke(Method.java:597)
> 607898146-    at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:508)
> 607898202-    at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:962)
> 607898266-    at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:958)
> 607898330-    at java.security.AccessController.doPrivileged(Native Method)
> 607898393-    at javax.security.auth.Subject.doAs(Subject.java:396)
> 607898448-    at org.apache.hadoop.ipc.Server$Handler.run(Server.java:956)
> 607898510-
> 607898511-    at org.apache.hadoop.ipc.Client.call(Client.java:742)
> 607898566-    at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:220)
> 607898625-    at $Proxy9.recoverBlock(Unknown Source)
> 607898666-    at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.processDatanodeError(DFSClient.java:2706)
> 607898761-    at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.access$1500(DFSClient.java:2173)
> 607898847-    at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$DataStreamer.run(DFSClient.java:2372)
> 607898938-2011-04-05 05:15:38,841 WARN org.apache.hadoop.hdfs.DFSClient: Error Recovery
for block blk_8685130736369835846_54964870 failed  because recovery from primary datanode
67.195.53.55:4610 failed 1 times.  Pipeline was 67.195.53.55:4610, 67.195.57.241:4610. Will
retry...
> 607899207-2011-04-05 05:15:38,842 INFO org.apache.hadoop.io.compress.CodecPool: Got brand-new
compressor
> 607899302-2011-04-05 05:15:38,852 INFO org.apache.hadoop.io.compress.CodecPool: Got brand-new
compressor
> 607899397-2011-04-05 05:15:38,852 INFO org.apache.hadoop.io.compress.CodecPool: Got brand-new
decompressor
> 607899494:2011-04-05 05:15:38,940 FATAL org.apache.hadoop.hbase.regionserver.HRegionServer:
ABORTING region server serverName=b3130123.yst.yahoo.net,60020,1301703415323, load=(requests=0,
regions=77, usedHeap=2521, maxHeap=7993): regionserver:60020-0x22f137aaab00096-0x22f137aaab00096
regionserver:60020-0x22f137aaab00096-0x22f137aaab00096 received expired from ZooKeeper, aborting
> 607899866-org.apache.zookeeper.KeeperException$SessionExpiredException: KeeperErrorCode
= Session expired
> 607899962-    at org.apache.hadoop.hbase.zookeeper.ZooKeeperWatcher.connectionEvent(ZooKeeperWatcher.java:328)
> 607900060-    at org.apache.hadoop.hbase.zookeeper.ZooKeeperWatcher.process(ZooKeeperWatcher.java:246)
> 607900150-    at org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:530)
> 607900232-    at org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:506)
> 607900305-2011-04-05 05:15:38,945 INFO org.apache.hadoop.hbase.regionserver.HRegionServer:
Dump of metrics: request=0.0, regions=77, stores=385, storefiles=1554, storefileIndexSize=2179,
memstoreSize=0, compactionQueueSize=55, usedHeap=2522, maxHeap=7993, blockCacheSize=13752888,
blockCacheFree=1662631752, blockCacheCount=0, blockCacheHitCount=0, blockCacheMissCount=43383235,
blockCacheEvictedCount=0, blockCacheHitRatio=0, blockCacheHitCachingRatio=0
> 607900750-2011-04-05 05:15:38,945 INFO org.apache.hadoop.hbase.regionserver.HRegionServer:
STOPPED: regionserver:60020-0x22f137aaab00096-0x22f137aaab00096 regionserver:60020-0x22f137aaab00096-0x22f137aaab00096
received expired from ZooKeeper, aborting
> 607900992-2011-04-05 05:15:38,945 INFO org.apache.zookeeper.ClientCnxn: EventThread shut
down
> 607901076:2011-04-05 05:15:39,022 FATAL org.apache.hadoop.hbase.regionserver.HRegionServer:
ABORTING region server serverName=b3130123.yst.yahoo.net,60020,1301703415323, load=(requests=0,
regions=77, usedHeap=2521, maxHeap=7993): Unhandled exception: org.apache.hadoop.hbase.YouAreDeadException:
Server REPORT rejected; currently processing b3130123.yst.yahoo.net,60020,1301703415323 as
dead server
> 607901465-org.apache.hadoop.hbase.YouAreDeadException: org.apache.hadoop.hbase.YouAreDeadException:
Server REPORT rejected; currently processing b3130123.yst.yahoo.net,60020,1301703415323 as
dead server
> 607901658-    at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
> 607901732-    at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:39)
> 607901829-    at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:27)
> 607901934-    at java.lang.reflect.Constructor.newInstance(Constructor.java:513)
> 607902002-    at org.apache.hadoop.ipc.RemoteException.instantiateException(RemoteException.java:96)
> 607902090-    at org.apache.hadoop.ipc.RemoteException.unwrapRemoteException(RemoteException.java:80)
> 607902179-    at org.apache.hadoop.hbase.regionserver.HRegionServer.tryRegionServerReport(HRegionServer.java:729)
> 607902280-    at org.apache.hadoop.hbase.regionserver.HRegionServer.run(HRegionServer.java:586)
> 607902363-    at java.lang.Thread.run(Thread.java:619)
> 607902405-Caused by: org.apache.hadoop.ipc.RemoteException: org.apache.hadoop.hbase.YouAreDeadException:
Server REPORT rejected; currently processing b3130123.yst.yahoo.net,60020,1301703415323 as
dead server
> 607902603-    at org.apache.hadoop.hbase.master.ServerManager.checkIsDead(ServerManager.java:197)
> 607902688-    at org.apache.hadoop.hbase.master.ServerManager.regionServerReport(ServerManager.java:247)
> 607902780-    at org.apache.hadoop.hbase.master.HMaster.regionServerReport(HMaster.java:638)
> 607902860-    at sun.reflect.GeneratedMethodAccessor5.invoke(Unknown Source)
> 607902924-    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
> 607903014-    at java.lang.reflect.Method.invoke(Method.java:597)
> 607903067-    at org.apache.hadoop.hbase.ipc.HBaseRPC$Server.call(HBaseRPC.java:570)
> 607903139-    at org.apache.hadoop.hbase.ipc.HBaseServer$Handler.run(HBaseServer.java:1036)
> 607903218-
> 607903219-    at org.apache.hadoop.hbase.ipc.HBaseClient.call(HBaseClient.java:753)
> 607903290-    at org.apache.hadoop.hbase.ipc.HBaseRPC$Invoker.invoke(HBaseRPC.java:257)
> 607903365-    at $Proxy3.regionServerReport(Unknown Source)
> 607903412-    at org.apache.hadoop.hbase.regionserver.HRegionServer.tryRegionServerReport(HRegionServer.java:723)
> 607903513- ... 2 more
>
>
>
> On 4/11/11 10:12 PM, "stack" <saint.ack@gmail.com> wrote:
>
> It would be worth studying when compactions run..... pick a few
> regions.   You should do all you can to minimize how much compacting
> you do (I wish I could come hang w/ you lot for a week to play with
> this stuff)
> St.Ack
>
> On Mon, Apr 11, 2011 at 9:35 PM, Vidhyashankar Venkataraman
> <vidhyash@yahoo-inc.com> wrote:
>> In fact, I had wanted to get a split for 30 days and then disable the splits
>> (so that I get a rough distribution of urls over a 30 day period: that's the
>> max expiration time of docs).
>>
>> But there were too many things happening to pinpoint the exact bottleneck.
>> So that's our next task once we disable splits: To find out a good
>> compaction frequency. Also, just so you realize, the max compact files is 5
>> which means minor compactions happens roughly every 5 hours or greater for
>> every region.
>>
>>
>> On 4/11/11 9:28 PM, "stack" <saint.ack@gmail.com> wrote:
>>
>>
>> Hmm... OK.
>>
>> Every hour.  Yes.  You want minor compactions to run then.  You want
>> to be careful though that we don't over compact.  We should study your
>> running cluster and see if there is anything we can surmise from how
>> it runs.  See if we can optimize our compaction settings for you
>> particular hourly case.
>>
>
>
> ------ End of Forwarded Message
>

Mime
View raw message