hbase-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "jiraposter@reviews.apache.org (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HBASE-4455) Rolling restart RSs scenario, -ROOT-, .META. regions are lost in AssignmentManager
Date Sat, 24 Sep 2011 03:23:28 GMT

    [ https://issues.apache.org/jira/browse/HBASE-4455?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13113887#comment-13113887
] 

jiraposter@reviews.apache.org commented on HBASE-4455:
------------------------------------------------------



bq.  On 2011-09-23 08:17:29, Jonathan Gray wrote:
bq.  > Great stuff!  I have some questions throughout but seems like this will make everything
more resilient to root/meta servers failing.  Is the general approach to always verify / always
check rather than relying on cached locations or values?
bq.  > 
bq.  > Have you thought about any ways that we could add some better unit tests around
this stuff?  There's a TestRollingRestart that is obviously not good enough :)

The repro of such bug depends on timing of events. Initially I thought perhaps we can inject
timeout into various places in the code. At this point, it is easier to just do the testing
on a small cluster and eventually the bug will appear. Perhaps something we can work on later.


bq.  On 2011-09-23 08:17:29, Jonathan Gray wrote:
bq.  > http://svn.apache.org/repos/asf/hbase/trunk/src/main/java/org/apache/hadoop/hbase/catalog/CatalogTracker.java,
line 309
bq.  > <https://reviews.apache.org/r/2007/diff/2/?file=45299#file45299line309>
bq.  >
bq.  >     why log the cached META server here?  didn't we just verify that it was not
valid?

Fixed.


bq.  On 2011-09-23 08:17:29, Jonathan Gray wrote:
bq.  > http://svn.apache.org/repos/asf/hbase/trunk/src/main/java/org/apache/hadoop/hbase/catalog/CatalogTracker.java,
line 310
bq.  > <https://reviews.apache.org/r/2007/diff/2/?file=45299#file45299line310>
bq.  >
bq.  >     why log the cached meta location here?  it might be confusing since it doesn't
log that we just found this meta location was invalid

Fixed.


bq.  On 2011-09-23 08:17:29, Jonathan Gray wrote:
bq.  > http://svn.apache.org/repos/asf/hbase/trunk/src/main/java/org/apache/hadoop/hbase/master/AssignmentManager.java,
line 2532
bq.  > <https://reviews.apache.org/r/2007/diff/2/?file=45300#file45300line2532>
bq.  >
bq.  >     add another * here, so: /**
bq.  >     
bq.  >     that ensure this gets picked up as javadoc

Fixed.


bq.  On 2011-09-23 08:17:29, Jonathan Gray wrote:
bq.  > http://svn.apache.org/repos/asf/hbase/trunk/src/main/java/org/apache/hadoop/hbase/master/AssignmentManager.java,
line 2558
bq.  > <https://reviews.apache.org/r/2007/diff/2/?file=45300#file45300line2558>
bq.  >
bq.  >     this looks like a random debug statement, what does matchZK, sn: server mean?

Fixed.


bq.  On 2011-09-23 08:17:29, Jonathan Gray wrote:
bq.  > http://svn.apache.org/repos/asf/hbase/trunk/src/main/java/org/apache/hadoop/hbase/zookeeper/RootRegionTracker.java,
lines 62-69
bq.  > <https://reviews.apache.org/r/2007/diff/2/?file=45309#file45309line62>
bq.  >
bq.  >     why this change?  should this be rolled into the ZKNodeTracker rather than overriding
the getData() behavior?

Fixed.


bq.  On 2011-09-23 08:17:29, Jonathan Gray wrote:
bq.  > http://svn.apache.org/repos/asf/hbase/trunk/src/main/java/org/apache/hadoop/hbase/zookeeper/RootRegionTracker.java,
line 90
bq.  > <https://reviews.apache.org/r/2007/diff/2/?file=45309#file45309line90>
bq.  >
bq.  >     it seems like you're covering up for bugs in the underlying ZKNodeTracker...
can we fix that instead?  or if it's a matter of returning a cached value or not, can we just
add a boolean flag for refresh/nocache?

Fixed.


bq.  On 2011-09-23 08:17:29, Jonathan Gray wrote:
bq.  > http://svn.apache.org/repos/asf/hbase/trunk/src/main/java/org/apache/hadoop/hbase/catalog/CatalogTracker.java,
line 294
bq.  > <https://reviews.apache.org/r/2007/diff/2/?file=45299#file45299line294>
bq.  >
bq.  >     so we always verify the connection now?

Before the fix, all callers set it to "true". So there is no behavior change.


bq.  On 2011-09-23 08:17:29, Jonathan Gray wrote:
bq.  > http://svn.apache.org/repos/asf/hbase/trunk/src/main/java/org/apache/hadoop/hbase/catalog/CatalogTracker.java,
line 368
bq.  > <https://reviews.apache.org/r/2007/diff/2/?file=45299#file45299line368>
bq.  >
bq.  >     why do we have two hard-coded timeouts in this area of code? :)
bq.  >     
bq.  >     this code seems to always sleep 500ms at a time unless you set timeout=0 and
then it loops every 50ms?  that doesn't seem right... i could set timeout to 100ms and it
would sleep 500ms.  sleeping 50ms every time would be better but there's probably a solution
with less overhead (doing remote read queries every 50ms in a loop)
bq.  >     
bq.  >     could we just notifyAll() on metaAvailable whenever we relocate root?

Choose the min(500ms, timeout) at this point, given we might do more code cleanup around RootRegionTracker
later. 


bq.  On 2011-09-23 08:17:29, Jonathan Gray wrote:
bq.  > http://svn.apache.org/repos/asf/hbase/trunk/src/main/java/org/apache/hadoop/hbase/master/handler/ServerShutdownHandler.java,
line 215
bq.  > <https://reviews.apache.org/r/2007/diff/2/?file=45304#file45304line215>
bq.  >
bq.  >     i'm also a bit confused by this.  couldn't we just increase the thread pool
size to 2? :)

Added more explanation in the ServerShutdownHandler.java about the scenario.


bq.  On 2011-09-23 08:17:29, Jonathan Gray wrote:
bq.  > http://svn.apache.org/repos/asf/hbase/trunk/src/main/java/org/apache/hadoop/hbase/master/AssignmentManager.java,
line 2533
bq.  > <https://reviews.apache.org/r/2007/diff/2/?file=45300#file45300line2533>
bq.  >
bq.  >     what about this method is specific to the shutdown server?  this seems specific
about regions in transition.  if we only use it in the context of servers being shut down
then maybe name it accordingly?  it does seem like a generally useful method though and just
related to ZK (could put it in a ZK util class?)

This method uses states in ZK and AssignmentManager. So it seems better to keep it in AssignmentManager.
Keep the name given it might be useful outside shutdown scenario.


bq.  On 2011-09-23 08:17:29, Jonathan Gray wrote:
bq.  > http://svn.apache.org/repos/asf/hbase/trunk/src/main/java/org/apache/hadoop/hbase/regionserver/handler/CloseRegionHandler.java,
line 144
bq.  > <https://reviews.apache.org/r/2007/diff/2/?file=45306#file45306line144>
bq.  >
bq.  >     is this normal?  should it be a warn?  maybe a comment on why this would happen

This could happen when the RS shutdowns. When RS shutdowns, setClosedState will try to transition
from CLOSING state to CLOSED. That will fail given the original state is OPENED instead of
CLOSING.

Normally when AssignmentManager tries to close a region, it will first set the node to CLOSING
before RPC call to RS. In that scenario, setClosedState will return successful.


- Ming


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/2007/#review2037
-----------------------------------------------------------


On 2011-09-24 01:50:02, Ming Ma wrote:
bq.  
bq.  -----------------------------------------------------------
bq.  This is an automatically generated e-mail. To reply, visit:
bq.  https://reviews.apache.org/r/2007/
bq.  -----------------------------------------------------------
bq.  
bq.  (Updated 2011-09-24 01:50:02)
bq.  
bq.  
bq.  Review request for hbase.
bq.  
bq.  
bq.  Summary
bq.  -------
bq.  
bq.  1. Add more logging.
bq.  2. Clean up CatalogTracker. waitForMeta waits for "timeout" value. When waitForMetaServerConnectionDefault
is called by MetaNodeTracker, the timeout value is large. So it doesn't retry in case .ROOT.
is updated; add the proper implementation for CatalogTracker.verifyMetaRegionLocation
bq.  4. Check for the latest -ROOT- and .META. region location during the handling of server
shutdown.
bq.  5. Right after assigning the -ROOT- or .META. in ServerShutdownHandler, don't block and
wait for .META. availability. Resubmit another ServerShutdownHandler for regular regions.
bq.  
bq.  
bq.  This addresses bug HBASE-4455.
bq.      https://issues.apache.org/jira/browse/HBASE-4455
bq.  
bq.  
bq.  Diffs
bq.  -----
bq.  
bq.    http://svn.apache.org/repos/asf/hbase/trunk/src/main/java/org/apache/hadoop/hbase/zookeeper/RootRegionTracker.java
1172205 
bq.    http://svn.apache.org/repos/asf/hbase/trunk/src/main/java/org/apache/hadoop/hbase/zookeeper/ZooKeeperNodeTracker.java
1172205 
bq.    http://svn.apache.org/repos/asf/hbase/trunk/src/test/java/org/apache/hadoop/hbase/zookeeper/TestZooKeeperNodeTracker.java
1172205 
bq.    http://svn.apache.org/repos/asf/hbase/trunk/src/main/java/org/apache/hadoop/hbase/regionserver/handler/OpenRegionHandler.java
1172205 
bq.    http://svn.apache.org/repos/asf/hbase/trunk/src/main/java/org/apache/hadoop/hbase/replication/ReplicationZookeeper.java
1172205 
bq.    http://svn.apache.org/repos/asf/hbase/trunk/src/main/java/org/apache/hadoop/hbase/zookeeper/ClusterStatusTracker.java
1172205 
bq.    http://svn.apache.org/repos/asf/hbase/trunk/src/main/java/org/apache/hadoop/hbase/regionserver/HRegionServer.java
1172205 
bq.    http://svn.apache.org/repos/asf/hbase/trunk/src/main/java/org/apache/hadoop/hbase/regionserver/handler/CloseRegionHandler.java
1172205 
bq.    http://svn.apache.org/repos/asf/hbase/trunk/src/main/java/org/apache/hadoop/hbase/master/handler/MetaServerShutdownHandler.java
1172205 
bq.    http://svn.apache.org/repos/asf/hbase/trunk/src/main/java/org/apache/hadoop/hbase/master/handler/OpenedRegionHandler.java
1172205 
bq.    http://svn.apache.org/repos/asf/hbase/trunk/src/main/java/org/apache/hadoop/hbase/master/handler/ServerShutdownHandler.java
1172205 
bq.    http://svn.apache.org/repos/asf/hbase/trunk/src/main/java/org/apache/hadoop/hbase/master/ServerManager.java
1172205 
bq.    http://svn.apache.org/repos/asf/hbase/trunk/src/main/java/org/apache/hadoop/hbase/master/AssignmentManager.java
1172205 
bq.    http://svn.apache.org/repos/asf/hbase/trunk/src/main/java/org/apache/hadoop/hbase/MasterAddressTracker.java
1172205 
bq.    http://svn.apache.org/repos/asf/hbase/trunk/src/main/java/org/apache/hadoop/hbase/catalog/CatalogTracker.java
1172205 
bq.  
bq.  Diff: https://reviews.apache.org/r/2007/diff
bq.  
bq.  
bq.  Testing
bq.  -------
bq.  
bq.  Keep Master up all the time, do rolling restart of RSs like this - stop RS1, wait for
2 seconds, stop RS2, start RS1, wait for 2 seconds, stop RS3, start RS2, wait for 2 seconds,
etc. The program can run for couple hours until it stops. -ROOT- and .META. are available
during that time.
bq.  
bq.  
bq.  Thanks,
bq.  
bq.  Ming
bq.  
bq.



> Rolling restart RSs scenario, -ROOT-, .META. regions are lost in AssignmentManager
> ----------------------------------------------------------------------------------
>
>                 Key: HBASE-4455
>                 URL: https://issues.apache.org/jira/browse/HBASE-4455
>             Project: HBase
>          Issue Type: Bug
>            Reporter: Ming Ma
>            Assignee: Ming Ma
>             Fix For: 0.92.0
>
>
> Keep Master up all the time, do rolling restart of RSs like this - stop RS1, wait for
2 seconds, stop RS2, start RS1, wait for 2 seconds, stop RS3, start RS2, wait for 2 seconds,
etc. After a while, you will find the -ROOT-, .META. regions aren't in "regions in transtion"
from AssignmentManager point of view, but they aren't assigned to any regions. Here are the
issues.
> 1. .-ROOT- or .META. location is stale when MetaServerShutdownHandler is invoked to check
if it contains -ROOT- region. That is due to long delay from ZK notification and async nature
of the system. Here is an example, even though new root region server sea-lab-1,60020,1316380133656
is set at T2, at T3 the shutdown process for sea-lab-1,60020,1316380133656, the root location
still points to old server sea-lab-3,60020,1316380037898.
> T1: 2011-09-18 14:08:52,470 DEBUG org.apache.hadoop.hbase.zookeeper.ZKUtil: master:6
> 0000-0x1327e43175e0000 Retrieved 29 byte(s) of data from znode /hbase/root-regio
> n-server and set watcher; sea-lab-3,60020,1316380037898
> T2: 2011-09-18 14:08:57,173 INFO org.apache.hadoop.hbase.catalog.RootLocationEditor:
Setting ROOT region location in ZooKeeper as sea-lab-1,60020,1316380133656
> T3: 2011-09-18 14:10:26,393 DEBUG org.apache.hadoop.hbase.master.ServerManager: Adde
> d=sea-lab-1,60020,1316380133656 to dead servers, submitted shutdown handler to be executed,
root=false, meta=true, current Root Location: sea-lab-3,60020,1316380037898
> T4: 2011-09-18 14:12:37,314 DEBUG org.apache.hadoop.hbase.zookeeper.ZKUtil: master:6
> 0000-0x1327e43175e0000 Retrieved 29 byte(s) of data from znode /hbase/root-region-server
and set watcher; sea-lab-1,60020,1316380133656
> 2. The MetaServerShutdownHandler worker thread that waits for -ROOT- or .META. availability
could be blocked. If meanwhile, the new server that -ROOT- or .META. is being assigned restarted,
another instance of MetaServerShutdownHandler is queued. Eventually, all MetaServerShutdownHandler
worker threads are filled up. It looks like HBASE-4245.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Mime
View raw message