hbase-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Sergey Shelukhin <ser...@hortonworks.com>
Subject Re: [hbase-5487] Master design discussion.
Date Tue, 22 Oct 2013 23:56:39 GMT
I'll try to update the doc tomorrow before the meetup..

> I suggest we start a thread about the hang/double assign/fencing scenario.
Fencing is already covered by WAL fencing. After WAL has been fenced, RS
cannot write even if it un-hangs, so I think it's not a problem.

> Can you point me to where I read and consider the wal fencing?  Do you
have
>thoughts or a comparison of these approaches?  Are there others techniques?
>...
>... We need to guarantee that the old master and new master are fenced off
>from each other to make sure they don't split brain.
Do you mean WAL fencing for RS or master? I meant master fencing off the
RSes.
That is already implemented, this is what we do now.

http://www.slideshare.net/enissoz/hbase-and-hdfs-understanding-filesystem-usage
slides 20-23 have the description.

Master can also fence off the previous master I guess. Actually I like it
better than ZK lease.
If we fenced off the WAL for previous master then we know it cannot write,
assuming we use system table of course.

For master - do you think the state management/transitions should be
resilient
to double-assignment of system region (two active masters)? It seems simpler
to implement outside of state machine. ZK lease could be one option, but
yeah
master5 or part1 don't cover master-master conflict.
Do you think they should?



>> When RS contacts master to do state update, it errs on the side of
caution
>> - no state update, no open region (or split).
>not sure what "it" in the second part of the sentence means here -- is this
>the master or the rs?  If it is the master I this is what hbck-master calls
>an "actual state update".
> If "it" mean RS's, I'm confused.
RS. RS completes all region open operations, where onlining it is a boolean
switch
(putting it in the map or such), then updates the state via master.
If it does update the state, master would know it. If it cannot, it doesn't
finish onlining the region and closes it instead.
That way, we are sure the region is not opened unless we have received the
opened message at some point.


>I don't think that is a safe assumption -- I still think double assignment
>can and will happen.  If an RS hangs long enough for the master to think it
>is dead, the master would reassign.  If that RS came back, it could still
>be open handling clients who cached the old location.
Not for writes (see fencing). For reads, it's nigh impossible to prevent
while
keeping reasonable MTTR and performance.


> I'm suggesting there are multiple version of truth since we have multiple
>actors.  What the master thinks and knows is one kind (hbck-master:
>currentState), what the master wants is another (hbck-master:
intendedState),
>and what the regionservers think are another (hbck-mater: actualState).
The
>master doesn't know exactly what is really going on (it could be
partitioned
>from some of the RS's that are still actively serving clients!).
> Rephrase of question:  "What do clients look at, and who modifies the
truth?".
Clients look at system table (via master, presumably). Single point of truth
in my view is for sum total state pertaining to the actions on the region.
It
is modified by both RS and master in some orderly, as well as atomic
(CAS-like) way. So, region operations do not need to take into account what
other actors are doing. E.g. they don't need to check some other nodes in
ZK, separate in-memory state, etc.; ditto for recovery. So, this covers
current and planned "logical state" from hbck-master. As for actual state, I
am not sure it needs to be covered. Actual state is ephemeral and is dealt
with based on notifications/responses from RSes/etc. It is covered in
operation descriptions, I guess.


>> As for any case of alter on enabled table, there are no guarantees for
>>clients. To provide these w/o disable/enable (or logical equivalent of
>>coordinating all close-s and open-s), one would need some form of
>>version-time-travel, or waiting for versions, or both.

>By version time travel you mean something like this right?
>[snip]
>I think we'd want to wait instead of allowing version time-travel and pass
>version info to the client that it needs to pass to RS requests.  This
would
>prevent version-time-travel which could potentially lead to data hazard
situations.
> (or maybe online alter should not be allowed to add/remove cfs -- only
modify
>their settings).
Actually that is what I meant by "version time travel" :) Client reads
previous versions of the regions. That is possible, but again orthogonal
to state or version management.
It is also not so simple to do - it's distributed MVCC, essentially. Plus, I
am not 100% sure of all the ways descriptor can affect client, and what
the support in HRegion will look like - will it be as simple as keeping
several
descriptors by version, or will there be larger changes.
It is compatible with the update scheme proposed, anyway.
Current model for online schema changes is that client just reads whatever
is
current, right?

> I think that user intent and system intent can probably be place in the
same
>place with just a flag to differentiate (that probably gives user intent
>priority).
There are two different levels of intent - the target of the already started
operation, and the next operation, e.g. opening region here and then user
wants to merge it. Unless I misunderstand target states, it would only be
able
to store one - either "Opened" or "Merging", right?
Imho you need to store both things, opening now, merge then (may not be able
to merge before opening). Or will master reconstruct the path to final
intent?

> This is the hang/double assign scenario again.
Covered above.

> There are properties that lead master5 consider the two to be the same
>state -- what are these invariants?  I felt they were different because
>they do a different set of IO operations.
Basically the region splitting/merging is still opened and can be read. From
client perspective it's opened; from master perspective it can do the same
set
of ops on it after canceling the split/merge activity. E.g. master move
the splitting region to closing just as well as opened, and split will fail
to
update state in the end and be abandoned. It's more of a notification
of something cooking for the opened region (or closed in case of some of the
merging regions).


> Let me rephrase as a few scenarios:
> * what happens if multiple operations are issued to a specific region
>concurrently (lets say split and merge involving the split target, or
>maybe something simpler) ? If so how does master5 resolve this?
Split will change the state of the region to splitting and create daughters,
merge will change the state of the regions to merging, and create daughter.
Given CAS semantics (MCAS actually, using multi-row tx-es), one of the state
changes will fail (on low level). Then, it might decide to cancel the other
operation (say, RS is splitting and the user wants to merge).
Then, it will change the state of the region. When split completes and is
ready to commit, it will attempt to CAS splitting state to split (and update
daughters), fail, and abandon work without actually doing the split.
When canceling split, message to abandon work earlier can be sent as an
optimization.

> * table disable is issued concurrently with a region split.  what are the
>possible out comes? what should happen?
Let me cover this in details in the doc... It depends on timing currently,
split may complete, then when new regions are to be opened, or
are opened, master will notice the table is disabling and close it. Or split
may be canceled in the similar manner as described above, making sure RS
cannot make the state change to commit it.



>> if opening (or closing) times out, master will fence off RS and mark
region
>>as closed. If there was some way of fencing region separately (ZK lease?)
it
>>would be possible to use that.
> Interesting.. why not just let it finish the open on the RS but not let
the
>client see that the region assigned there, and then close? (this is why I'm
>trying to separate what client's see from what the master sees)
For opening - why wait? :) Actually for opening timing out is not even
necessary for disabled case, we can simply ignore it as long as RS cannot
update to Opened. So we can move it right to Closed.
For closing - because it timed out, so we cannot be sure what is going on;
region  we have to fence off, because the region may remain opened for
writes.

> I'm thinking we can stop the writes to all regions of a table on an RS
once
>the RS knows the table is disabling.  This could be done by the version
number
>threading idea or maybe some other fencing mechanism.
That would narrow the window, but not eliminate it; so it should be as good
as
opportunistic "close" message. We could narrow the window a little bit more
by sending another "stop writing" message, but I am not sure how valuable
that is.
Or do you mean using some protocol to make sure all region servers stop
writes
before disabling?



On Tue, Oct 22, 2013 at 2:40 PM, Sergey Shelukhin <sergey@hortonworks.com>wrote:

> Sorry, I was busy, let me respond to Jonathan's comment now.
> Lars: both mine and Jonathan's spec try to address it as one of the main
> things.
> Both specs are similar actually.
>
>
> On Sat, Oct 19, 2013 at 3:10 PM, Lars Hofhansl <lhofhansl@yahoo.com>wrote:
>
>> Didn't read the spec, yet.
>> My main gripe currently is that there are too many places holding the
>> state: the fs, .meta., zk, copies in ram at both master and region servers,
>> etc. Some of the problems we've seen were due to the various copy of this
>> state getting out of sync.
>>
>> Now I'll shut up and read the speed.
>>
>> -- Lars
>>
>>
>> Jonathan Hsieh <jon@cloudera.com> wrote:
>>
>> >Responses inline.   I'm going to wait for next doc before I ask more
>> >questions about the master5 specifics. :)
>> >
>> >I suggest we start a thread about the hang/double assign/fencing
>> scenario.
>> >
>> >Jon.
>> >
>> >On Sat, Oct 19, 2013 at 7:17 AM, Jonathan Hsieh <jon@cloudera.com>
>> wrote:
>> >
>> >> Here's sergey's replies from jira (modulo some reformatting for email.)
>> >>
>> >> ----
>> >> Answers lifted from email also (some fixes + one answer was modified
>> due
>> >> to clarification here ).
>> >>
>> >> What is a failure and how do you react to failures? I think the master5
>> >>> design needs to spend more effort to considering failure and recovery
>> >>> cases. I claim there are 4 types of responses from a networked IO
>> operation
>> >>> - two states we normally deal with ack successful, ack failed (nack)
>> and
>> >>> unknown due to timeout that succeeded (timeout success) and unknown
>> due to
>> >>> timeout that failed (timeout failed). We have historically missed the
>> last
>> >>> two cases and they aren't considered in the master5 design.
>> >>
>> >>
>> >> There are a few considerations. Let me examine if there are other cases
>> >> than these.
>> >> I am assuming the collocated table, which should reduce such cases for
>> >> state (probably, if collocated table cannot be written reliably, master
>> >> must stop-the-world and fail over).
>> >>
>> >
>> >For the master, I agree.  In the case of nack it knows it failed and
>> should
>> >failover, in the case of timeouts it should try verify and retry and if
>> it
>> >cannot it should abdicate.  This seems equivalent to the 9x-master's "if
>> we
>> >can't get to zk we abort" behavior.  We need to guarantee that the old
>> >master and new master are fenced off from each other to make sure they
>> >don't split brain.
>> >
>> >Off list, you suggested wal fencing and hbck-master sketched out a lease
>> >timeouts mechanism that provides isolation. Correctness in the face of
>> >hangs and partitions are my biggest concern, and a lot of the questions
>> I'm
>> >asking focus on these scenarios.
>> >
>> >Can you point me to where I read and consider the wal fencing?  Do you
>> have
>> >thoughts or a comparison of these approaches?  Are there others
>> techniques?
>> >
>> >
>> >> When RS contacts master to do state update, it errs on the side of
>> caution
>> >> - no state update, no open region (or split).
>> >>
>> >
>> >not sure what "it" in the second part of the sentence means here -- is
>> this
>> >the master or the rs?  If it is the master I this is what hbck-master
>> calls
>> >an "actual state update".
>> >
>> >If "it" mean RS's, I'm confused.
>> >
>> >
>> >> Thus, except for the case of multiple masters running, we can always
>> >> assume RS didn't online the region if we don't know about it.
>> >>
>> >
>> >I don't think that is a safe assumption -- I still think double
>> assignment
>> >can and will happen.  If an RS hangs long enough for the master to think
>> it
>> >is dead, the master would reassign.  If that RS came back, it could still
>> >be open handling clients who cached the old location.
>> >
>> > Then, for messages to RS, see "Note on messages"; they are idempotent so
>> >> they can always be resent.
>> >>
>> >>
>> >> 1) State update coordination. What is a "state updates from the
>> outside"
>> >>> Do RS's initiate splitting on their own? Maybe a picture would help
>> so we
>> >>> can figure out if it is similar or different from hbck-master's?
>> >>
>> >>
>> >> Yes, these are RS messages. They are mentioned in some operation
>> >> descriptions in part 2 - opening->opened, closing->closed; splitting,
>> etc.
>> >>
>> >> [will revisit when part 2 posted]
>> >
>> >
>> >> 2) Single point of truth. hbck-master tries to define what single
>> point of
>> >>> truth means by defining intended, current, and actual state data with
>> >>> durability properties on each kind. What do clients look at who
>> modifies
>> >>> what?
>> >>
>> >>
>> >> Sorry, don't understand the question. I mean single source of truth
>> mainly
>> >> about what is going on with the region; it is described in design
>> >> considerations.
>> >> I like the idea of "intended state", however without more detailed
>> reading
>> >> I am not sure how it works for multiple ops e.g. master recovering the
>> >> region while the user intends to split it, so the split should be
>> executed
>> >> after it's opened.
>> >>
>> >> I'm suggesting there are multiple version of truth since we have
>> multiple
>> >actors.  What the master thinks and knows is one kind (hbck-master:
>> >currentState), what the master wants is another (hbck-master:
>> >intendedState), and what the regionservers think are another (hbck-mater:
>> >actualState).  The master doesn't know exactly what is really going on
>> (it
>> >could be partitioned from some of the RS's that are still actively
>> serving
>> >clients!).
>> >
>> >Rephrase of question:  "What do clients look at, and who modifies the
>> >truth?".
>> >
>> >
>> >>
>> >> 3) Table record: "if regions is out of date, it should be closed and
>> >>> reopened". It is not clear in master5 how regionservers find out that
>> they
>> >>> are out of date. Moreover, how do clients talking to those RS's with
>> stale
>> >>> versions know they are going to the correct RS especially in the face
>> of RS
>> >>> failures due to timeout?
>> >>
>> >>
>> >> On alter (and startup if failed), master tries to reopen all regions
>> that
>> >> are out of date.
>> >> Regions that are not opened with either pick up the new version when
>> they
>> >> are opened, or (e.g. if they are now Opening with old version) master
>> >> discovers they are out of date when they are transitioned to Opened by
>> RS,
>> >> and reopens them again.
>> >>
>> >> I buy this part of the versioning scheme.  (though this another place
>> >succeptable to the hang/double assign scenario I described a few
>> questions
>> >earlier).
>> >
>> >
>> >> As for any case of alter on enabled table, there are no guarantees for
>> >> clients.
>> >>
>> >To provide these w/o disable/enable (or logical equivalent of
>> coordinating
>> >> all close-s and open-s), one would need some form of
>> version-time-travel,
>> >> or waiting for versions, or both.
>> >>
>> >
>> >By version time travel you mean something like this right?
>> >client gets from region R1 with version 1.
>> >client gets from region R2 with version 1.
>> >alter initiated.
>> >R1 updated to v2.
>> >client gets from R1 with version 2.
>> >client gets from R2 with version 1 (version time travel here).
>> >R2 updated to V2.
>> >
>> >I think we'd want to wait instead of allowing version time-travel and
>> pass
>> >version info to the client that it needs to pass to RS requests.  This
>> >would prevent version-time-travel which could potentially lead to data
>> >hazard situations.  (or maybe online alter should not be allowed to
>> >add/remove cfs -- only modify their settings).
>> >
>> >
>> >
>> >> 4) region record: transition states. This is really similar to
>> >>> hbck-masters current state and intended state. Shouldn't be defined
>> as part
>> >>> of the region record?
>> >>
>> >>
>> >> I mention somewhere that could be done. One thing is that if several
>> paths
>> >> are possible between states, it's useful to know which is taken.
>> >> But do note that I store user intent separately from what is currently
>> >> going on, so they are not exactly similar as far as I see.
>> >>
>> >> [I'll wait for the next master5 doc to consider the paths you are
>> mention.]
>> >
>> >I think that user intent and system intent can probably be place in the
>> >same place with just a flag to differentiate (that probably gives user
>> >intent priority).
>> >
>> >
>> >>
>> >> 5) Note on user operations: the forgetting thing is scary to me – in
>> your
>> >>> move split example, what happens if an RS reads state that is
>> forgotten?
>> >>
>> >>
>> >> I think my description of this might be too vague. State is not
>> forgotten;
>> >> previous intent is forgotten. I.e. if user does several operations in
>> order
>> >> that conflict (e.g. split and then merge), the first one will be
>> canceled
>> >> (safely ).
>> >> Also, RS does not read state as a guideline to what needs to be done.
>> >>
>> >> Got it.  I think this is similar in hbck-master.  In hbck-master
>> parlance
>> >-- the intended state may be updated multiple times by the user.  Instead
>> >of canceling however, hbck-master would figure out how to recover to the
>> >latest intended state.  How this is done in hbck-master is generic but
>> its
>> >a little fuzzy on the details currently (where is it a FSM and where is
>> it
>> >a push-down automata (PDA))..
>> >
>> >
>> >> 6) table state machine. how do we guarantee clients are writing from
>> the
>> >>> correct version in the in failures?
>> >>
>> >>
>> >> Can you please elaborate?
>> >>
>> >> This is the hang/double assign scenario again.
>> >
>> >
>> >>
>> >> 7) region state machine. Earlier draft hand splitting and merge cases.
>> Are
>> >>> they elided in master5 or are not present any more. How would this get
>> >>> extended handle jeffrey's distributed log replay/fast write recovery
>> >>> feature?
>> >>
>> >>
>> >> As I mention somewhere these could be separate states. I was kind of
>> >> afraid of blowing up state machine too much, so I noticed that for
>> >> split/merge you anyway store siblings/children, so you can recognize
>> them
>> >> and for most purposes different split-merge states are the same as
>> Opened
>> >> and Closed.
>> >>
>> >> I will add those back, it would make sense.
>> >>
>> >> There are properties that lead master5 consider the two to be the same
>> >state -- what are these invariants?  I felt they were different because
>> >they do a different set of IO operations.
>> >
>> >
>> >>  8) logical interactions: sounds like master5 allows concurrent region
>> and
>> >>> table operations. hbck-master (though not fully documented) only
>> allows
>> >>> certain region transitions when the table is enabled or if the table
>> is
>> >>> disabled. Are we sure we don't get into race conditions? What happens
>> if
>> >>> disable gets issued – its possible for someone to reopens the region
>> and
>> >>> for old clients to continue writing to it even though it is closed?
>> >>
>> >>
>> >> Yes, parallelism is intended. You can never be sure you have no races
>> but
>> >> we should aim for it
>> >>
>> >>
>> >You can prove that you won't be affected by races and can eliminate cases
>> >where you are.
>> >
>> >Let me rephrase as a few scenarios:
>> >* what happens if multiple operations are issued to a specific region
>> >concurrently (lets say split and merge involving the split target, or
>> maybe
>> >something simpler) ? If so how does master5 resolve this?
>> >* table disable is issued concurrently with a region split.  what are the
>> >possible out comes? what should happen?
>> >
>> >master5 is missing disabled/enabled check, that is a mistake.
>> >>
>> >> Part1 operation interactions already cover it:
>> >>
>> >> table disable doesn't ack until all regions are closed (master5 is
>> wrong ).
>> >>
>> >
>> >[ah, the shorter doc is updated from master5.. I'll wait for the new one]
>> >
>> >region opening cannot start if table is already disabling or disabled.
>> >>
>> >good
>> >
>> >if region is already opening when disable is issued, opening will be
>> >> opportunistically canceled.
>> >>
>> >if disable fails to cancel opening, or server opens it first in a race,
>> >> region will be opened, and master will issue close immediately after
>> state
>> >> update. Given that region is not closed, disable is not complete.
>> >> if opening (or closing) times out, master will fence off RS and mark
>> >> region as closed. If there was some way of fencing region separately
>> (ZK
>> >> lease?) it would be possible to use that.
>> >>
>> >> Interesting.. why not just let it finish the open on the RS but not let
>> >the client see that the region assigned there, and then close? (this is
>> why
>> >I'm trying to separate what client's see from what the master sees)
>> >
>> >I think hbck-master would just update intended state and then separately
>> >"recover" the problem if the region showed up open in actual state.
>> >
>> >In any case, until client checks table state before every write, there's
>> no
>> >> easy way to prevent writes on disabling table. Writes on disabled table
>> >> will not be possible.
>> >>
>> >> I'm thinking we can stop the writes to all regions of a table on an RS
>> >once the RS knows the table is disabling.  This could be done by the
>> >version number threading idea or maybe some other fencing mechanism.
>> >
>> >
>> >> On ensuring there's no double assignment due to RS hanging:
>> >> The intent is to fence the WAL for region server, the way we do now.
>> One
>> >> could also use other mechanism.
>> >>
>> >
>> >[I need a pointer to this]
>> >
>> >
>> >> Perhaps I could specify it more clearly; I think the problem of making
>> >> sure RS is dead is nearly orthogonal.
>> >> In my model, due to how opening region is committed to opened, we can
>> only
>> >> be unsure when the region is in Opened state (or similar states such as
>> >> Splitting which are not present in my current version, but will be
>> added).
>> >> In that case, in absence of normal transition, we cannot do literally
>> >> anything with the region unless we are sufficiently sure that RS is
>> >> sufficiently dead (e.g. cannot write).
>> >> So, while we ensure that RS is dead we don't reassign.
>> >> My document implies (but doesn't elaborate, I'll fix that) that master
>> >> does direct Opened->Closed direct transition only when that is true.
>> >> A state called "MaybeOpened" could be added. Let me add it...
>> >>
>> >> looking forward to the update.
>> >
>> >
>> >> --
>> >> // Jonathan Hsieh (shay)
>> >> // Software Engineer, Cloudera
>> >> // jon@cloudera.com
>> >>
>> >>
>> >
>> >
>> >
>> >--
>> >// Jonathan Hsieh (shay)
>> >// Software Engineer, Cloudera
>> >// jon@cloudera.com
>>
>
>

-- 
CONFIDENTIALITY NOTICE
NOTICE: This message is intended for the use of the individual or entity to 
which it is addressed and may contain information that is confidential, 
privileged and exempt from disclosure under applicable law. If the reader 
of this message is not the intended recipient, you are hereby notified that 
any printing, copying, dissemination, distribution, disclosure or 
forwarding of this communication is strictly prohibited. If you have 
received this communication in error, please contact the sender immediately 
and delete it from your system. Thank You.

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message