hbase-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "HBase Review Board (JIRA)" <j...@apache.org>
Subject [jira] Commented: (HBASE-3181) Review, document, and fix up Regions-in-Transition timeout logic
Date Tue, 02 Nov 2010 02:04:28 GMT

    [ https://issues.apache.org/jira/browse/HBASE-3181?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12927251#action_12927251
] 

HBase Review Board commented on HBASE-3181:
-------------------------------------------

Message from: "Jonathan Gray" <jgray@apache.org>


bq.  On 2010-11-01 14:48:51, stack wrote:
bq.  > Commit after reviewing the below.  This patch makes a HUGE difference.  My trashing
rolling restart of a 10 node cluster of 3k regions now works (mostly).  Seems like an errant
region the odd time I test but thats for hbck to figure now....  Great work JGray.

Okay, will fix up minor stuff below and commit.  Thanks for review and all the testing!


bq.  On 2010-11-01 14:48:51, stack wrote:
bq.  > trunk/src/main/java/org/apache/hadoop/hbase/catalog/MetaReader.java, line 501
bq.  > <http://review.cloudera.org/r/1143/diff/5/?file=16357#file16357line501>
bq.  >
bq.  >     How did this ever work?  Why didn't typing catch this?

This code never did work right.  This is all new master code.  Also, the HSI returning method
I think in MetaReader is new from after I did fixes related to TestMasterFailover.

It would only be an issue if an RS died and came back up, and the new instance of it was assigned
regions before the shutdown handling of the previous instance even started.

That's the killer part of the test you've been doing and this is definitely a key part of
this fix.


bq.  On 2010-11-01 14:48:51, stack wrote:
bq.  > trunk/src/main/java/org/apache/hadoop/hbase/master/AssignmentManager.java, line
111
bq.  > <http://review.cloudera.org/r/1143/diff/5/?file=16358#file16358line111>
bq.  >
bq.  >     We doing this or not?  Seems like we are.  Should we change this back to plain
HashMap... else when concurrent map it gives impression that its ok to meddle in regionPlan
w/o first synchronizing?

we mostly do but not always.  let's open another jira for this because there's this AssignmentManager.updateTimers()
call that "tickles" stuff that maybe should not hold a lock?

Opened HBASE-3188 (did not target to 0.90 but we can certainly do now)


bq.  On 2010-11-01 14:48:51, stack wrote:
bq.  > trunk/src/main/java/org/apache/hadoop/hbase/master/AssignmentManager.java, line
644
bq.  > <http://review.cloudera.org/r/1143/diff/5/?file=16358#file16358line644>
bq.  >
bq.  >     want to doc this new param? (Not important -- its self-describing)

adding javadoc


bq.  On 2010-11-01 14:48:51, stack wrote:
bq.  > trunk/src/main/java/org/apache/hadoop/hbase/master/AssignmentManager.java, line
841
bq.  > <http://review.cloudera.org/r/1143/diff/5/?file=16358#file16358line841>
bq.  >
bq.  >     We don't have to do the remove anymore?  Passing force new plan does this for
us?

Yeah.  Javadoc states that if true and plan exists, will create a new one.


bq.  On 2010-11-01 14:48:51, stack wrote:
bq.  > trunk/src/main/java/org/apache/hadoop/hbase/master/AssignmentManager.java, line
935
bq.  > <http://review.cloudera.org/r/1143/diff/5/?file=16358#file16358line935>
bq.  >
bq.  >     Should this test go into a method of its own?  Is it repeated elsewhere?

No, that's only place.  If method, would take three args so probably not much simpler.


bq.  On 2010-11-01 14:48:51, stack wrote:
bq.  > trunk/src/main/java/org/apache/hadoop/hbase/master/AssignmentManager.java, line
998
bq.  > <http://review.cloudera.org/r/1143/diff/5/?file=16358#file16358line998>
bq.  >
bq.  >     What happens if state.isClosing now? (You dropped it from else if here)

We don't force out of isClosing state this way anymore.  In new timeout handling, we will
only send another close rpc if the RS never transitions to CLOSING.  Once the RS transitions
to CLOSING (state.isClosing()) it does one of three things: it finishes (goes to CLOSED),
dies (RS expire), or fails (deletes node).  We handle those accordingly.


bq.  On 2010-11-01 14:48:51, stack wrote:
bq.  > trunk/src/main/java/org/apache/hadoop/hbase/master/AssignmentManager.java, line
1012
bq.  > <http://review.cloudera.org/r/1143/diff/5/?file=16358#file16358line1012>
bq.  >
bq.  >     Are all this.regions synchronized?  Same for this.regionPlans?  You might want
to give it all a review before commit?

regions looked fine when i did a review of it and RIT.  regionPlans needs review.  HBASE-3188


bq.  On 2010-11-01 14:48:51, stack wrote:
bq.  > trunk/src/main/java/org/apache/hadoop/hbase/master/AssignmentManager.java, line
1027
bq.  > <http://review.cloudera.org/r/1143/diff/5/?file=16358#file16358line1027>
bq.  >
bq.  >     I think spell it out rather than NSRE in the message.

done


bq.  On 2010-11-01 14:48:51, stack wrote:
bq.  > trunk/src/main/java/org/apache/hadoop/hbase/master/AssignmentManager.java, line
1047
bq.  > <http://review.cloudera.org/r/1143/diff/5/?file=16358#file16358line1047>
bq.  >
bq.  >     Is this ok?  Calling assign inside a synchronize on RIT?  Is that done everywhere?
 Or, this is the assign that takes a RegionState.. and expects the synchronize to be in place?

It seems okay except looking at it more, we end up doing a bit of work under RIT lock that
we don't need to.  And all locking needed is done within assign().  Walked through it three
times, we should move assign outside RIT lock.

Will do that.  Need to keep RIT lock around forceRegionStateToOffline() though


bq.  On 2010-11-01 14:48:51, stack wrote:
bq.  > trunk/src/main/java/org/apache/hadoop/hbase/master/AssignmentManager.java, line
1549
bq.  > <http://review.cloudera.org/r/1143/diff/5/?file=16358#file16358line1549>
bq.  >
bq.  >     Nothing to do here?  Job for hbck?

It's hard to imagine how it would happen, so must be hbck I guess.  hbck repair would wipe
all ZK state, ensure region is closed everywhere, then do assign w/ set zk offline=true


bq.  On 2010-11-01 14:48:51, stack wrote:
bq.  > trunk/src/main/java/org/apache/hadoop/hbase/master/AssignmentManager.java, line
1569
bq.  > <http://review.cloudera.org/r/1143/diff/5/?file=16358#file16358line1569>
bq.  >
bq.  >     What happens here?  I suppose we used shutdown server?

Changing to "Node did not exist, can't time this out".  I guess this is hbck.

I'm trying not to capture every odd-ball case that "should be" impossible (or when possible,
they should be ignored).

So we could see these things happening, when master gets really overloaded and backed up,
but when this is the case, we should ignore these states as events will be triggered down
the road to handle things accordingly.


bq.  On 2010-11-01 14:48:51, stack wrote:
bq.  > trunk/src/main/java/org/apache/hadoop/hbase/master/ServerManager.java, line 584
bq.  > <http://review.cloudera.org/r/1143/diff/5/?file=16360#file16360line584>
bq.  >
bq.  >     Thats an awkward sentence for a debug message...

rewritten as english.


bq.  On 2010-11-01 14:48:51, stack wrote:
bq.  > trunk/src/main/java/org/apache/hadoop/hbase/master/handler/ClosedRegionHandler.java,
line 66
bq.  > <http://review.cloudera.org/r/1143/diff/5/?file=16361#file16361line66>
bq.  >
bq.  >     We were not using this?  Probably for the best.  Probably stale by time we went
to act on it?

it's implied now.  this is historical from first implementation of the handler stuff.  we
used to have one handler for multiple states, but now CLOSED has it's own handler so we don't
need the state anymore.


- Jonathan


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
http://review.cloudera.org/r/1143/#review1755
-----------------------------------------------------------





> Review, document, and fix up Regions-in-Transition timeout logic
> ----------------------------------------------------------------
>
>                 Key: HBASE-3181
>                 URL: https://issues.apache.org/jira/browse/HBASE-3181
>             Project: HBase
>          Issue Type: Improvement
>          Components: master, regionserver, zookeeper
>    Affects Versions: 0.90.0
>            Reporter: Jonathan Gray
>            Assignee: Jonathan Gray
>            Priority: Blocker
>             Fix For: 0.90.0
>
>
> In some of the testing Stack and I have been doing, we've uncovered some issues with
concurrent RS failure and when the Master is under heavy load.  It's led to situations where
we handle ZK events far after they actually occur and have uncovered some issues in our timeout
logic.
> This jira is about reviewing the timeout semantics, especially around ZK usage, and ensuring
that we handle things appropriately.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


Mime
View raw message