hbase-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Jean-Daniel Cryans (JIRA)" <j...@apache.org>
Subject [jira] Updated: (HBASE-1439) race between master and regionserver after missed heartbeat
Date Tue, 23 Jun 2009 18:38:07 GMT

     [ https://issues.apache.org/jira/browse/HBASE-1439?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Jean-Daniel Cryans updated HBASE-1439:
--------------------------------------

    Priority: Blocker  (was: Major)

I did a little experiment to confirm some assumptions I have. I started 1 master, 1 RS and
a ZK server on my machine on a fresh HDFS. Right when the RS finished opening the ROOT and
META regions, I killed the ZK server to simulate a network partition. The master got:

{code}
org.apache.hadoop.hbase.client.HConnectionManager$TableServers: Got ZooKeeper event, state:
Disconnected, type: None, path: null
{code}

The RS got:

{code}
org.apache.hadoop.hbase.regionserver.HRegionServer: Got ZooKeeper event, state: Disconnected,
type: None, path: null
{code}

Then both nodes continued to live and that's a real problem. If ZK is the boss and you can't
talk to it, you should do an emergency shutdown. They have something like that in Bigtable:

{quote}
To ensure that a Bigtable cluster is not vulnera-
ble to networking issues between the master and Chubby,
the master kills itself if its Chubby session expires.
{quote}

I also tried deleting the hbase/rs/number to simulate a RS being unable to contact ZK but
still able to talk to HDFS and I got a boat load of:

{code}
2009-06-23 14:32:21,502 INFO org.apache.hadoop.ipc.HBaseServer: IPC Server handler 9 on 60000,
call regionServerReport(address: 192.168.1.86:60020, 
startcode: 1245781887285, load: (requests=0, regions=2, usedHeap=25, maxHeap=991), [Lorg.apache.hadoop.hbase.HMsg;@4baa2c23,

[Lorg.apache.hadoop.hbase.HRegionInfo;@1137d4a4) from 192.168.1.86:37090: error: org.apache.hadoop.hbase.Leases$LeaseStillHeldException
org.apache.hadoop.hbase.Leases$LeaseStillHeldException
	at org.apache.hadoop.hbase.master.ServerManager.regionServerReport(ServerManager.java:251)
	at org.apache.hadoop.hbase.master.HMaster.regionServerReport(HMaster.java:662)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
	at java.lang.reflect.Method.invoke(Method.java:597)
	at org.apache.hadoop.hbase.ipc.HBaseRPC$Server.call(HBaseRPC.java:643)
	at org.apache.hadoop.hbase.ipc.HBaseServer$Handler.run(HBaseServer.java:913)
2009-06-23 14:32:21,511 INFO org.apache.hadoop.ipc.HBaseServer: IPC Server handler 0 on 60000,
call regionServerReport(address: 192.168.1.86:60020, 
startcode: 1245781887285, load: (requests=0, regions=2, usedHeap=26, maxHeap=991), [Lorg.apache.hadoop.hbase.HMsg;@6e4eeaaf,

[Lorg.apache.hadoop.hbase.HRegionInfo;@7f11bfbc) from 192.168.1.86:37090: error: org.apache.hadoop.hbase.Leases$LeaseStillHeldException
org.apache.hadoop.hbase.Leases$LeaseStillHeldException
	at org.apache.hadoop.hbase.master.ServerManager.regionServerReport(ServerManager.java:251)
	at org.apache.hadoop.hbase.master.HMaster.regionServerReport(HMaster.java:662)
	at sun.reflect.GeneratedMethodAccessor2.invoke(Unknown Source)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
	at java.lang.reflect.Method.invoke(Method.java:597)
	at org.apache.hadoop.hbase.ipc.HBaseRPC$Server.call(HBaseRPC.java:643)
	at org.apache.hadoop.hbase.ipc.HBaseServer$Handler.run(HBaseServer.java:913)
2009-06-23 14:32:21,515 INFO org.apache.hadoop.ipc.HBaseServer: IPC Server handler 1 on 60000,
call regionServerReport(address: 192.168.1.86:60020, 
startcode: 1245781887285, load: (requests=0, regions=2, usedHeap=26, maxHeap=991), [Lorg.apache.hadoop.hbase.HMsg;@664310d0,

[Lorg.apache.hadoop.hbase.HRegionInfo;@3d04fc23) from 192.168.1.86:37090: error: org.apache.hadoop.hbase.Leases$LeaseStillHeldException
org.apache.hadoop.hbase.Leases$LeaseStillHeldException
	at org.apache.hadoop.hbase.master.ServerManager.regionServerReport(ServerManager.java:251)
	at org.apache.hadoop.hbase.master.HMaster.regionServerReport(HMaster.java:662)
	at sun.reflect.GeneratedMethodAccessor2.invoke(Unknown Source)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
	at java.lang.reflect.Method.invoke(Method.java:597)
	at org.apache.hadoop.hbase.ipc.HBaseRPC$Server.call(HBaseRPC.java:643)
	at org.apache.hadoop.hbase.ipc.HBaseServer$Handler.run(HBaseServer.java:913)
{code}

The RS was receiving that kind of errors too:

{code}
2009-06-23 14:32:21,507 WARN org.apache.hadoop.hbase.regionserver.HRegionServer: Processing
message (Retry: 0)
org.apache.hadoop.hbase.Leases$LeaseStillHeldException
	at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
	at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:39)
	at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:27)
	at java.lang.reflect.Constructor.newInstance(Constructor.java:513)
	at org.apache.hadoop.hbase.RemoteExceptionHandler.decodeRemoteException(RemoteExceptionHandler.java:94)
	at org.apache.hadoop.hbase.RemoteExceptionHandler.checkThrowable(RemoteExceptionHandler.java:48)
	at org.apache.hadoop.hbase.RemoteExceptionHandler.checkIOException(RemoteExceptionHandler.java:66)
	at org.apache.hadoop.hbase.regionserver.HRegionServer.run(HRegionServer.java:561)
	at java.lang.Thread.run(Thread.java:619)
2009-06-23 14:32:21,512 WARN org.apache.hadoop.hbase.regionserver.HRegionServer: Processing
message (Retry: 1)
org.apache.hadoop.hbase.Leases$LeaseStillHeldException
	at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
	at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:39)
	at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:27)
	at java.lang.reflect.Constructor.newInstance(Constructor.java:513)
	at org.apache.hadoop.hbase.RemoteExceptionHandler.decodeRemoteException(RemoteExceptionHandler.java:94)
	at org.apache.hadoop.hbase.RemoteExceptionHandler.checkThrowable(RemoteExceptionHandler.java:48)
	at org.apache.hadoop.hbase.RemoteExceptionHandler.checkIOException(RemoteExceptionHandler.java:66)
	at org.apache.hadoop.hbase.regionserver.HRegionServer.run(HRegionServer.java:561)
	at java.lang.Thread.run(Thread.java:619)
{code}

I'm marking this issue as a blocker since it's really bad.

> race between master and regionserver after missed heartbeat
> -----------------------------------------------------------
>
>                 Key: HBASE-1439
>                 URL: https://issues.apache.org/jira/browse/HBASE-1439
>             Project: Hadoop HBase
>          Issue Type: Bug
>    Affects Versions: 0.19.1
>         Environment: CentOS 5.2 x86_64, HBase 0.19.1, Hadoop 0.19.1
>            Reporter: Andrew Purtell
>            Assignee: Jean-Daniel Cryans
>            Priority: Blocker
>             Fix For: 0.20.0
>
>
> Seen on one of our 0.19.1 clusters:
> {code}
> java.io.FileNotFoundException: File does not exist: hdfs://jdc2-atr-dc-2.atr.trendmicro.com:50000
> /data/hbase/log_10.3.134.207_1242286427894_60020/hlog.dat.1242528291898
>  at org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:415)
>  at org.apache.hadoop.fs.FileSystem.getLength(FileSystem.java:679)
>  at org.apache.hadoop.hbase.io.SequenceFile$Reader.<init>(SequenceFile.java:1431)
>  at org.apache.hadoop.hbase.io.SequenceFile$Reader.<init>(SequenceFile.java:1426)
>  at org.apache.hadoop.hbase.regionserver.HLog.splitLog(HLog.java:753)
>  at org.apache.hadoop.hbase.regionserver.HLog.splitLog(HLog.java:716)
>  at org.apache.hadoop.hbase.master.ProcessServerShutdown.process(ProcessServerShutdown.java:249)
>  at org.apache.hadoop.hbase.master.HMaster.processToDoQueue(HMaster.java:442)
>  at org.apache.hadoop.hbase.master.HMaster.run(HMaster.java:377)
> 2009-05-17 04:05:55,481 INFO org.apache.hadoop.hbase.master.RegionServerOperation: process
> shutdown of server 10.3.134.207:60020: logSplit: false, rootRescanned: false, numberOfMetaRegions:
1,
> onlineMetaRegions.size(): 1
> {code}
> I do not have the region server log yet, but here is my conjecture:
> Here, the write ahead log for 10.3.134.207 is missing in DFS: java.io.FileNotFoundException:
hdfs://jdc2-atr-dc-2.atr.trendmicro.com:50000/data/hbase/log_10.3.134.207_1242286427894_60020/hlog.dat.1242528291898
when the master tries to split it after declaring the region server crashed. There have been
recent trouble reports on this cluster that indicate severe memory stress, e.g. kernel panics
due to OOM. Based on that I think it is likely that the region server here missed a heartbeat
so the master declared it crashed and began to split the log. But, the log was then deleted
out from underneath the master's split thread. I think the region server was actually still
running but partially swapped out or the node was otherwise overloaded so it missed its heartbeat.
Then, when the region server came back after being swapped in, it realized it missed its heartbeat
and shut down, deleting the log as is normally done. 
> Even if the above is not the actual cause in this case, I think the scenario is plausible.
What do you think?

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


Mime
View raw message