mesos-reviews mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Vinod Kone <vinodk...@gmail.com>
Subject Re: Review Request 53897: Changed how master represents "recovered" frameworks.
Date Tue, 06 Dec 2016 02:39:06 GMT


> On Dec. 3, 2016, 2:12 a.m., Vinod Kone wrote:
> > src/master/master.cpp, line 7124
> > <https://reviews.apache.org/r/53897/diff/6/?file=1574956#file1574956line7124>
> >
> >     CHECK_NOTNULL(framework);
> 
> Neil Conway wrote:
>     Is there a general rule for when to add `CHECK_NOTNULL` for internal functions that
take pointer parameters? `Master::failoverFramework`, `Master::_failoverFramework`, `Master::_exited(Framework*)`,
`Master::drop`, and `Master::authorizeTask` could all have `CHECK_NOTNULL`s added, for example.

i think we should do CHEcK_NOTNULL() for all methods that take raw pointers. the above ones
you cited seem to have missed them.


> On Dec. 3, 2016, 2:12 a.m., Vinod Kone wrote:
> > src/master/master.cpp, lines 7132-7137
> > <https://reviews.apache.org/r/53897/diff/6/?file=1574956#file1574956line7132>
> >
> >     hmm. it's a bit weird that activating a framework also updates it. this should
be done by the callers before calling this method.
> 
> Neil Conway wrote:
>     Hmmm, why is this weird? To me it seems reasonable that when we activate a recovered
framework, we're given the `FrameworkInfo` that was presented by the framework on re-registration;
we use that `FrameworkInfo` to update whatever `FrameworkInfo` we previously had for the recovered
framework.
>     
>     If we move this outside of `activateRecoveredFramework`, we'd need identical code
in both of its call-sites -- that's not too bad, but I'm not sure why it is an improvement.

i meant the name "activateRecoveredFramework" doesn't seemt to suggest that it also "updates"
in addition to "activate"? usually they are done by different functions (e.g., Allocator::activateFramework(),
Allocator::updateFramework(), Allocator::activateSlave(), Allocator::updateSlave). when i
think of activating a recovered framework, my intuition is that it goes from RECOVERED to
ACTIVATED state in the master and maybe gets activated in the allocator (basically things
that need to be done to get a framework activated).

More importanly, i realized that previously a framework could update its FrameworkInfo willy
nilly after a master failover, but now it cannot because `updateFrameworkInfo` doesn't allow
updates to all fields. That's probably ok but maybe worth calling out?


- Vinod


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/53897/#review157867
-----------------------------------------------------------


On Dec. 4, 2016, 2:06 a.m., Neil Conway wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/53897/
> -----------------------------------------------------------
> 
> (Updated Dec. 4, 2016, 2:06 a.m.)
> 
> 
> Review request for mesos and Vinod Kone.
> 
> 
> Bugs: MESOS-6419
>     https://issues.apache.org/jira/browse/MESOS-6419
> 
> 
> Repository: mesos
> 
> 
> Description
> -------
> 
> After master failover, the new master doesn't know which frameworks were
> registered with the previous master (because this information is not
> currently stored in the registry). In the period after the master fails
> over but before the framework scheduler has re-registered, the master
> learns about the frameworks in the cluster when agents re-register (an
> agent reports the FrameworkInfo for all of the frameworks it is running
> when it re-registers).
> 
> Such frameworks were previously represented separately from the normal
> list of frameworks in the master: the master kept a collection of
> `FrameworkInfo` for these "recovered" frameworks.
> 
> This commit removes this separate collection of "recovered" frameworks.
> Instead, the master now treats recovered frameworks very similarly to
> frameworks that are registered but currently disconnected. For example,
> recovered frameworks will now have a `Framework` object which tracks the
> tasks/executors running under that framework; recovered frameworks will
> be reported via the normal "frameworks" key when querying HTTP
> endpoints. Similarly, "teardown" operations on recovered frameworks will
> now work correctly (MESOS-6419).
> 
> This means there is no longer a concept of "orphan tasks" [1]: if the
> master knows about a task, the task will be running under a framework
> (albeit the framework might be recovered or disconnected). A new
> "recovered" key has been added to various HTTP endpoints/APIs to
> determine if a framework hasn't yet re-registered after master failover.
> 
> [1] The exception here is if the cluster contains Mesos agents older
> than 1.0, because old Mesos agents don't report `FrameworkInfo`s when
> they re-register.
> 
> 
> Diffs
> -----
> 
>   include/mesos/master/master.proto 966105cf14bf92ad15a38653cd5c7f3322821289 
>   include/mesos/v1/master/master.proto 0e5a8674a76a46c6f2cc17e11cd735f756d94fd1 
>   src/master/http.cpp ac560d1fdd219d0de0c5d987a32a7112e149602f 
>   src/master/master.hpp b444b2352360fb4f7179acd97dffc0cd81cc7afa 
>   src/master/master.cpp b0670d993348d189fafff0f83f9da0c5b18d1c51 
>   src/tests/api_tests.cpp afae6a7e0809174f48f280f170fad9315e80a906 
>   src/tests/fault_tolerance_tests.cpp 59f68f65fac9fca4a6941793b712bfe7bb30c880 
>   src/tests/master_allocator_tests.cpp bb94e38d5bb472801366c172cfc036f2eecdcbcb 
>   src/tests/master_authorization_tests.cpp 06c6e47250003cd467bf37e1daa8bd636ef8fef5 
>   src/tests/master_tests.cpp dfedbbdf78e8054813872e9eeebccc7504097751 
>   src/tests/partition_tests.cpp 5a0d4bd2de6a5aa0e9fdf0d34cd10d16fd4e34a1 
>   src/tests/teardown_tests.cpp 125e1808e51da53fdf1bb1babc956ff062cbb102 
> 
> Diff: https://reviews.apache.org/r/53897/diff/
> 
> 
> Testing
> -------
> 
> `make check`
> 
> 
> Thanks,
> 
> Neil Conway
> 
>


Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message