mesos-reviews mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Benjamin Bannier <benjamin.bann...@mesosphere.io>
Subject Re: Review Request 55571: Changed Master::Framework::updateFrameworkInfo so it can return errors.
Date Wed, 18 Jan 2017 19:07:13 GMT


> On Jan. 17, 2017, 10:07 a.m., Jay Guo wrote:
> > src/master/master.cpp, lines 7201-7203
> > <https://reviews.apache.org/r/55571/diff/1/?file=1605757#file1605757line7201>
> >
> >     There may be some corner cases that causes this `CHECK` to coredump. For example:
> >     1. Start a Mesos cluster with master, agent and a scheduler with single role
`role1`.
> >     2. Master failover
> >     3. Agent re-register before scheduler does
> >     4. Scheduler re-subscribe with multi-role `{role2}`
> >     
> >     In this case, master re-construct FrameworkInfo from re-registered agent, which
contains old role. However, in current implementation, `Error` of `updateFrameworkInfo` is
not caught for recovered frameworks.
> >     
> >     In theory, this test case shouldn't cause master coredump:
> >     ```cpp
> >       master::Flags masterFlags = CreateMasterFlags();
> >       Try<Owned<cluster::Master>> master = StartMaster(masterFlags);
> >       ASSERT_SOME(master);
> >     
> >       MockExecutor exec(DEFAULT_EXECUTOR_ID);
> >       TestContainerizer containerizer(&exec);
> >     
> >       StandaloneMasterDetector slaveDetector(master.get()->pid);
> >       Try<Owned<cluster::Slave>> slave = StartSlave(&slaveDetector,
&containerizer);
> >       ASSERT_SOME(slave);
> >     
> >       FrameworkInfo frameworkInfo = DEFAULT_FRAMEWORK_INFO;
> >       frameworkInfo.set_failover_timeout(1024);
> >       frameworkInfo.set_role("role1");
> >     
> >       Future<FrameworkID> frameworkId;
> >     
> >       {
> >         MockScheduler sched;
> >         MesosSchedulerDriver driver(
> >             &sched, frameworkInfo, master.get()->pid, DEFAULT_CREDENTIAL);
> >     
> >         EXPECT_CALL(sched, registered(&driver, _, _))
> >           .WillOnce(FutureArg<1>(&frameworkId));
> >     
> >         Future<vector<Offer>> offers;
> >         EXPECT_CALL(sched, resourceOffers(&driver, _))
> >           .WillOnce(FutureArg<1>(&offers))
> >           .WillRepeatedly(Return()); // Ignore subsequent offers.
> >     
> >         driver.start();
> >     
> >         Clock::pause();
> >         Clock::settle();
> >         Clock::resume();
> >     
> >         AWAIT_READY(frameworkId);
> >     
> >         AWAIT_READY(offers);
> >         EXPECT_FALSE(offers->empty());
> >     
> >         TaskInfo task;
> >         task.set_name("");
> >         task.mutable_task_id()->set_value("1");
> >         task.mutable_slave_id()->MergeFrom(offers.get()[0].slave_id());
> >         task.mutable_resources()->MergeFrom(offers.get()[0].resources());
> >         task.mutable_executor()->MergeFrom(DEFAULT_EXECUTOR_INFO);
> >     
> >         EXPECT_CALL(exec, registered(_, _, _, _))
> >           .Times(1);
> >     
> >         EXPECT_CALL(exec, launchTask(_, _))
> >           .WillOnce(SendStatusUpdateFromTask(TASK_RUNNING));
> >     
> >         Future<TaskStatus> runningStatus;
> >         EXPECT_CALL(sched, statusUpdate(&driver, _))
> >           .WillOnce(FutureArg<1>(&runningStatus));
> >     
> >         driver.launchTasks(offers.get()[0].id(), {task});
> >     
> >         AWAIT_READY(runningStatus);
> >         EXPECT_EQ(TASK_RUNNING, runningStatus.get().state());
> >         EXPECT_EQ(task.task_id(), runningStatus.get().task_id());
> >       }
> >     
> >       slaveDetector.appoint(None());
> >     
> >       master->reset();
> >       master = StartMaster(masterFlags);
> >       ASSERT_SOME(master);
> >     
> >       Future<SlaveReregisteredMessage> slaveReregisteredMessage =
> >         FUTURE_PROTOBUF(SlaveReregisteredMessage(), _, _);
> >     
> >       slaveDetector.appoint(master.get()->pid);
> >     
> >       AWAIT_READY(slaveReregisteredMessage);
> >     
> >       frameworkInfo.mutable_id()->CopyFrom(frameworkId.get());
> >       frameworkInfo.add_roles("role2");
> >       frameworkInfo.clear_role();
> >       frameworkInfo.add_capabilities()->set_type(
> >           FrameworkInfo::Capability::MULTI_ROLE);
> >     
> >       MockScheduler sched;
> >       MesosSchedulerDriver driver(
> >           &sched, frameworkInfo, master.get()->pid, DEFAULT_CREDENTIAL);
> >     
> >       Future<Nothing> registered;
> >       EXPECT_CALL(sched, registered(&driver, _, _))
> >         .WillOnce(FutureSatisfy(&registered));
> >     
> >       driver.start();
> >     
> >       Clock::pause();
> >       Clock::settle();
> >       Clock::resume();
> >     
> >       AWAIT_READY(registered);
> >     
> >       EXPECT_CALL(exec, shutdown(_))
> >         .Times(AtMost(1));
> >     
> >       driver.stop();
> >       driver.join();
> >     ```
> 
> Benjamin Bannier wrote:
>     Thanks for spotting this, this is of course unacceptable. Also your test case was
helpful. I updated the patch to allow `Master::activateRecoveredFramework` to fail.
> 
> Jay Guo wrote:
>     Maybe you could add this test case as part of our test suite as well?

I added the test case to the follow-up patch, https://reviews.apache.org/r/55271.


- Benjamin


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/55571/#review161827
-----------------------------------------------------------


On Jan. 18, 2017, 4:26 p.m., Benjamin Bannier wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/55571/
> -----------------------------------------------------------
> 
> (Updated Jan. 18, 2017, 4:26 p.m.)
> 
> 
> Review request for mesos, Benjamin Mahler, Jay Guo, and Guangya Liu.
> 
> 
> Bugs: MESOS-6631
>     https://issues.apache.org/jira/browse/MESOS-6631
> 
> 
> Repository: mesos
> 
> 
> Description
> -------
> 
> This allows us to confirm that an updated FrameworkInfo is fully
> compatible with an already known one. By performing this check in
> updateFrameworkInfo when can be sure that both the new and old
> FrameworkInfo are available (e.g., after framework reregistration, or
> master or framework failover).
> 
> 
> Diffs
> -----
> 
>   src/master/master.hpp 8e8a9037af11cf95961b6498540a0fd486ed091b 
>   src/master/master.cpp 73159328ce3fd838e02eba0e6a30cf69efc319ba 
> 
> Diff: https://reviews.apache.org/r/55571/diff/
> 
> 
> Testing
> -------
> 
> Tested as part of https://reviews.apache.org/r/55271/.
> 
> 
> Thanks,
> 
> Benjamin Bannier
> 
>


Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message