hbase-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "stack (JIRA)" <j...@apache.org>
Subject [jira] Commented: (HBASE-2428) NPE in ProcessRegionClose because meta is offline kills master and thus the cluster
Date Wed, 28 Apr 2010 06:42:32 GMT

    [ https://issues.apache.org/jira/browse/HBASE-2428?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12861695#action_12861695
] 

stack commented on HBASE-2428:
------------------------------

The unit test over in HBASE-2414 has turned up root cause.  If a RegionServerOperation is
unable to complete -- ProcessRegionClose, ProcessServerShutdown are all RegionServerOperations,
the events processed by the HMaster -- say because meta is offlined, its put on a DelayQueue.
 On construction of the RegionServerOperation the 'delay' is set.  The DelayQueue is processed
before new events when the master is going about its business.  Whats happening above is that
the ProcessRegionClose failed to complete because meta was offline.  It was put on the DelayQueue.
 The first put on the DelayQueue was probably fine... we waited some time and then retried
but the meta still hadn't come online.  We were put on the DelayQueue again only now our delay
had turned negative.  Since delayqueue is processed first, all we did was try to process the
close even though in the other more live todo queue, we might have the process of the regionserver
shutdown waiting, the process of the regionserver that was carrying .meta.  We'd never recover.

Will make a patch in morning.

> NPE in ProcessRegionClose because meta is offline kills master and thus the cluster
> -----------------------------------------------------------------------------------
>
>                 Key: HBASE-2428
>                 URL: https://issues.apache.org/jira/browse/HBASE-2428
>             Project: Hadoop HBase
>          Issue Type: Bug
>            Reporter: stack
>            Priority: Blocker
>             Fix For: 0.20.5
>
>
> This issue was born of study done in hbase-2413.  The meta went offline and we were processing
a region close at the same time.  The close processing fell into a NPE loop and wouldn't get
out of it killing master and effectivly killing the cluster:
> {code}
> 2010-03-31 17:50:57,004 INFO org.apache.hadoop.hbase.master.ServerManager: hbasetest020.X.X.X,60020,1270077892989
znode expired
> 2010-03-31 17:50:57,004 INFO org.apache.hadoop.hbase.master.RegionManager: META region
removed from onlineMetaRegions
> 2010-03-31 17:51:15,385 INFO org.apache.hadoop.hbase.master.ServerManager: Received start
message from: hbasetest020.X.X.X,60020,1270083075377
> 2010-03-31 17:51:15,399 DEBUG org.apache.hadoop.hbase.zookeeper.ZooKeeperWrapper: Updated
ZNode /hbase/rs/1270083075377 with data 10.18.35.215:60020
> 2010-03-31 17:51:15,870 DEBUG org.apache.hadoop.hbase.master.RegionManager: Server is
overloaded: load=5, avg=3.0, slop=0.3
> 2010-03-31 17:51:15,870 DEBUG org.apache.hadoop.hbase.master.RegionManager: Choosing
to reassign 2 regions. mostLoadedRegions has 5 regions in it.
> 2010-03-31 17:51:15,870 DEBUG org.apache.hadoop.hbase.master.RegionManager: Going to
close region test1,3147000000,1270081876965
> 2010-03-31 17:51:15,870 DEBUG org.apache.hadoop.hbase.master.RegionManager: Going to
close region test1,9352000000,1270080893514
> 2010-03-31 17:51:15,870 INFO org.apache.hadoop.hbase.master.RegionManager: Skipped 0
region(s) that are in transition states
> 2010-03-31 17:51:15,878 INFO org.apache.hadoop.hbase.master.ServerManager: Processing
MSG_REPORT_CLOSE: test1,3147000000,1270081876965 from hbasetest019.X.X.X,60020,1270082983630;
1 of 2
> 2010-03-31 17:51:15,879 INFO org.apache.hadoop.hbase.master.ServerManager: Processing
MSG_REPORT_CLOSE: test1,9352000000,1270080893514 from hbasetest019.X.X.X,60020,1270082983630;
2 of 2
> 2010-03-31 17:51:44,897 DEBUG org.apache.hadoop.hbase.master.ProcessRegionClose$1: Trying
to contact region server for regionName '.META.,,1', but failed after 10 attempts.
> Exception 1:
> java.io.IOException: Call to /10.18.35.215:60020 failed on local exception: java.io.EOFExceptionException
1:
> java.net.ConnectException: Connection refusedException 1:
> java.net.ConnectException: Connection refusedException 1:
> java.net.ConnectException: Connection refusedException 1:
> java.net.ConnectException: Connection refusedException 1:
> java.net.ConnectException: Connection refusedException 1:
> java.net.ConnectException: Connection refusedException 1:
> org.apache.hadoop.hbase.NotServingRegionException: org.apache.hadoop.hbase.NotServingRegionException:
.META.,,1
>         at org.apache.hadoop.hbase.regionserver.HRegionServer.getRegion(HRegionServer.java:2282)
>         at org.apache.hadoop.hbase.regionserver.HRegionServer.delete(HRegionServer.java:1989)
>         at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>         at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
>         at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
>         at java.lang.reflect.Method.invoke(Method.java:597)
>         at org.apache.hadoop.hbase.ipc.HBaseRPC$Server.call(HBaseRPC.java:577)
>         at org.apache.hadoop.hbase.ipc.HBaseServer$Handler.run(HBaseServer.java:915)
> Exception 1:
> org.apache.hadoop.hbase.NotServingRegionException: org.apache.hadoop.hbase.NotServingRegionException:
.META.,,1
>         at org.apache.hadoop.hbase.regionserver.HRegionServer.getRegion(HRegionServer.java:2282)
>         at org.apache.hadoop.hbase.regionserver.HRegionServer.delete(HRegionServer.java:1989)
>         at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>         at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
>         at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
>         at java.lang.reflect.Method.invoke(Method.java:597)
>         at org.apache.hadoop.hbase.ipc.HBaseRPC$Server.call(HBaseRPC.java:577)
>         at org.apache.hadoop.hbase.ipc.HBaseServer$Handler.run(HBaseServer.java:915)
> org.apache.hadoop.hbase.NotServingRegionException: org.apache.hadoop.hbase.NotServingRegionException:
.META.,,1
>         at org.apache.hadoop.hbase.regionserver.HRegionServer.getRegion(HRegionServer.java:2282)
>         at org.apache.hadoop.hbase.regionserver.HRegionServer.delete(HRegionServer.java:1989)
>         at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>         at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
>         at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
>         at java.lang.reflect.Method.invoke(Method.java:597)
>         at org.apache.hadoop.hbase.ipc.HBaseRPC$Server.call(HBaseRPC.java:577)
>         at org.apache.hadoop.hbase.ipc.HBaseServer$Handler.run(HBaseServer.java:915)
>         at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
>         at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:39)
>         at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:27)
>         at java.lang.reflect.Constructor.newInstance(Constructor.java:513)
>         at org.apache.hadoop.hbase.RemoteExceptionHandler.decodeRemoteException(RemoteExceptionHandler.java:94)
>         at org.apache.hadoop.hbase.master.RetryableMetaOperation.doWithRetries(RetryableMetaOperation.java:74)
>         at org.apache.hadoop.hbase.master.ProcessRegionClose.process(ProcessRegionClose.java:63)
>         at org.apache.hadoop.hbase.master.HMaster.processToDoQueue(HMaster.java:494)
>         at org.apache.hadoop.hbase.master.HMaster.run(HMaster.java:429)
> 2010-03-31 17:51:44,899 DEBUG org.apache.hadoop.hbase.master.HMaster: Processing todo:
ProcessRegionClose of test1,1230000000,1270081673808, false, reassign: true
> 2010-03-31 17:51:44,899 DEBUG org.apache.hadoop.hbase.master.ProcessRegionClose$1: Exception
in RetryableMetaOperation:
> java.lang.NullPointerException
>         at org.apache.hadoop.hbase.master.RetryableMetaOperation.doWithRetries(RetryableMetaOperation.java:64)
>         at org.apache.hadoop.hbase.master.ProcessRegionClose.process(ProcessRegionClose.java:63)
>         at org.apache.hadoop.hbase.master.HMaster.processToDoQueue(HMaster.java:494)
>         at org.apache.hadoop.hbase.master.HMaster.run(HMaster.java:429)
> 2010-03-31 17:51:44,900 WARN org.apache.hadoop.hbase.master.HMaster: Processing pending
operations: ProcessRegionClose of test1,1230000000,1270081673808, false, reassign: true
> java.lang.RuntimeException: java.lang.NullPointerException
>         at org.apache.hadoop.hbase.master.RetryableMetaOperation.doWithRetries(RetryableMetaOperation.java:96)
>         at org.apache.hadoop.hbase.master.ProcessRegionClose.process(ProcessRegionClose.java:63)
>         at org.apache.hadoop.hbase.master.HMaster.processToDoQueue(HMaster.java:494)
>         at org.apache.hadoop.hbase.master.HMaster.run(HMaster.java:429)
> Caused by: java.lang.NullPointerException
>         at org.apache.hadoop.hbase.master.RetryableMetaOperation.doWithRetries(RetryableMetaOperation.java:64)
>         ... 3 more
> {code}
> ... and so on.
> Marking a blocker for 0.20.5.
> {code}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


Mime
View raw message