mesos-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Anand Mazumdar (JIRA)" <j...@apache.org>
Subject [jira] [Updated] (MESOS-4831) Master sometimes sends two inverse offers after the agent goes into maintenance.
Date Wed, 02 Mar 2016 00:30:18 GMT

     [ https://issues.apache.org/jira/browse/MESOS-4831?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Anand Mazumdar updated MESOS-4831:
----------------------------------
    Description: 
Showed up on ASF CI for {{MasterMaintenanceTest.PendingUnavailabilityTest}}

https://builds.apache.org/job/Mesos/1748/COMPILER=gcc,CONFIGURATION=--verbose,ENVIRONMENT=GLOG_v=1%20MESOS_VERBOSE=1,OS=ubuntu:14.04,label_exp=(docker%7C%7CHadoop)&&(!ubuntu-us1)/consoleFull

{code}
I0229 11:08:57.027559   668 hierarchical.cpp:1437] No resources available to allocate!
I0229 11:08:57.027745   668 hierarchical.cpp:1150] Performed allocation for slave fd39ca89-d7fd-4df8-ad50-dbb493d1cd7b-S0
in 272747ns
I0229 11:08:57.027757   675 master.cpp:5369] Sending 1 offers to framework fd39ca89-d7fd-4df8-ad50-dbb493d1cd7b-0000
(default)
I0229 11:08:57.028586   675 master.cpp:5459] Sending 1 inverse offers to framework fd39ca89-d7fd-4df8-ad50-dbb493d1cd7b-0000
(default)
I0229 11:08:57.029039   675 master.cpp:5459] Sending 1 inverse offers to framework fd39ca89-d7fd-4df8-ad50-dbb493d1cd7b-0000
(default)
{code}

The ideal expected workflow for this test is something like:

- The framework receives offers from master.
- The framework updates its maintenance schedule.
- The current offer is rescinded.
- A new offer is received from the master with unavailability set.
- After the agent goes for maintenance, an inverse offer is sent.

For some reason, in the logs we see that the master is sending 2 inverse offers. The test
seems to pass as we just check for the initial inverse offer being present. This can also
be reproduced by a modified version of the original test.

{code}
// Test ensures that an offer will have an `unavailability` set if the
// slave is scheduled to go down for maintenance.
TEST_F(MasterMaintenanceTest, PendingUnavailabilityTest)
{
  Try<PID<Master>> master = StartMaster();
  ASSERT_SOME(master);

  MockExecutor exec(DEFAULT_EXECUTOR_ID);

  Try<PID<Slave>> slave = StartSlave(&exec);
  ASSERT_SOME(slave);

  auto scheduler = std::make_shared<MockV1HTTPScheduler>();

  EXPECT_CALL(*scheduler, heartbeat(_))
    .WillRepeatedly(Return()); // Ignore heartbeats.

  Future<Nothing> connected;
  EXPECT_CALL(*scheduler, connected(_))
    .WillOnce(FutureSatisfy(&connected))
    .WillRepeatedly(Return()); // Ignore future invocations.

  scheduler::TestV1Mesos mesos(master.get(), ContentType::PROTOBUF, scheduler);

  AWAIT_READY(connected);

  Future<Event::Subscribed> subscribed;
  EXPECT_CALL(*scheduler, subscribed(_, _))
    .WillOnce(FutureArg<1>(&subscribed));

  Future<Event::Offers> normalOffers;
  Future<Event::Offers> unavailabilityOffers;
  Future<Event::Offers> inverseOffers;
  EXPECT_CALL(*scheduler, offers(_, _))
    .WillOnce(FutureArg<1>(&normalOffers))
    .WillOnce(FutureArg<1>(&unavailabilityOffers))
    .WillOnce(FutureArg<1>(&inverseOffers));

  // The original offers should be rescinded when the unavailability is changed.
  Future<Nothing> offerRescinded;
  EXPECT_CALL(*scheduler, rescind(_, _))
    .WillOnce(FutureSatisfy(&offerRescinded));

  {
    Call call;
    call.set_type(Call::SUBSCRIBE);

    Call::Subscribe* subscribe = call.mutable_subscribe();
    subscribe->mutable_framework_info()->CopyFrom(DEFAULT_V1_FRAMEWORK_INFO);

    mesos.send(call);
  }

  AWAIT_READY(subscribed);

  v1::FrameworkID frameworkId(subscribed->framework_id());

  AWAIT_READY(normalOffers);
  EXPECT_NE(0, normalOffers->offers().size());

  // Regular offers shouldn't have unavailability.
  foreach (const v1::Offer& offer, normalOffers->offers()) {
    EXPECT_FALSE(offer.has_unavailability());
  }

  // Schedule this slave for maintenance.
  MachineID machine;
  machine.set_hostname(maintenanceHostname);
  machine.set_ip(stringify(slave.get().address.ip));

  const Time start = Clock::now() + Seconds(60);
  const Duration duration = Seconds(120);
  const Unavailability unavailability = createUnavailability(start, duration);

  // Post a valid schedule with one machine.
  maintenance::Schedule schedule = createSchedule(
      {createWindow({machine}, unavailability)});

  // We have a few seconds between the first set of offers and the
  // next allocation of offers. This should be enough time to perform
  // a maintenance schedule update. This update will also trigger the
  // rescinding of offers from the scheduled slave.
  Future<Response> response = process::http::post(
      master.get(),
      "maintenance/schedule",
      headers,
      stringify(JSON::protobuf(schedule)));

  AWAIT_EXPECT_RESPONSE_STATUS_EQ(OK().status, response);

  // The original offers should be rescinded when the unavailability
  // is changed.
  AWAIT_READY(offerRescinded);

  AWAIT_READY(unavailabilityOffers);
  EXPECT_NE(0, unavailabilityOffers->offers().size());

  // Make sure the new offers have the unavailability set.
  foreach (const v1::Offer& offer, unavailabilityOffers->offers()) {
    EXPECT_TRUE(offer.has_unavailability());
    EXPECT_EQ(
        unavailability.start().nanoseconds(),
        offer.unavailability().start().nanoseconds());

    EXPECT_EQ(
        unavailability.duration().nanoseconds(),
        offer.unavailability().duration().nanoseconds());
  }

  // We also expect an inverse offer for the slave to go under
  // maintenance.
  AWAIT_READY(inverseOffers);
  EXPECT_NE(0, inverseOffers->inverse_offers().size());

  EXPECT_CALL(exec, shutdown(_))
    .Times(AtMost(1));

  EXPECT_CALL(*scheduler, disconnected(_))
    .Times(AtMost(1));

  Shutdown(); // Must shutdown before 'containerizer' gets deallocated.
}
{code}

Also, unrelated, we need to clean up this test to not expect multiple offers i.e. remove {{numberOfOffers}}
constant.

  was:
Showed up on ASF CI for {{MasterMaintenanceTest.PendingUnavailabilityTest}}

https://builds.apache.org/job/Mesos/1748/COMPILER=gcc,CONFIGURATION=--verbose,ENVIRONMENT=GLOG_v=1%20MESOS_VERBOSE=1,OS=ubuntu:14.04,label_exp=(docker%7C%7CHadoop)&&(!ubuntu-us1)/consoleFull

{code}
I0229 11:08:57.027559   668 hierarchical.cpp:1437] No resources available to allocate!
I0229 11:08:57.027745   668 hierarchical.cpp:1150] Performed allocation for slave fd39ca89-d7fd-4df8-ad50-dbb493d1cd7b-S0
in 272747ns
I0229 11:08:57.027757   675 master.cpp:5369] Sending 1 offers to framework fd39ca89-d7fd-4df8-ad50-dbb493d1cd7b-0000
(default)
I0229 11:08:57.028586   675 master.cpp:5459] Sending 1 inverse offers to framework fd39ca89-d7fd-4df8-ad50-dbb493d1cd7b-0000
(default)
I0229 11:08:57.029039   675 master.cpp:5459] Sending 1 inverse offers to framework fd39ca89-d7fd-4df8-ad50-dbb493d1cd7b-0000
(default)
{code}

The ideal expected workflow for this test is something like:

- The framework receives offers from master.
- The framework updates its maintenance schedule.
- The current offer is rescinded.
- A new offer is received from the master with unavailability set.
- After the agent goes for maintenance, an inverse offer is sent.

For some reason, in the logs we see that the master is sending 2 inverse offers. The test
seems to pass as we just check for the initial inverse offer being present. 

Also, unrelated, we need to clean up this test to not expect multiple offers i.e. remove {{numberOfOffers}}
constant.


> Master sometimes sends two inverse offers after the agent goes into maintenance.
> --------------------------------------------------------------------------------
>
>                 Key: MESOS-4831
>                 URL: https://issues.apache.org/jira/browse/MESOS-4831
>             Project: Mesos
>          Issue Type: Bug
>    Affects Versions: 0.27.0
>            Reporter: Anand Mazumdar
>              Labels: maintenance, mesosphere
>
> Showed up on ASF CI for {{MasterMaintenanceTest.PendingUnavailabilityTest}}
> https://builds.apache.org/job/Mesos/1748/COMPILER=gcc,CONFIGURATION=--verbose,ENVIRONMENT=GLOG_v=1%20MESOS_VERBOSE=1,OS=ubuntu:14.04,label_exp=(docker%7C%7CHadoop)&&(!ubuntu-us1)/consoleFull
> {code}
> I0229 11:08:57.027559   668 hierarchical.cpp:1437] No resources available to allocate!
> I0229 11:08:57.027745   668 hierarchical.cpp:1150] Performed allocation for slave fd39ca89-d7fd-4df8-ad50-dbb493d1cd7b-S0
in 272747ns
> I0229 11:08:57.027757   675 master.cpp:5369] Sending 1 offers to framework fd39ca89-d7fd-4df8-ad50-dbb493d1cd7b-0000
(default)
> I0229 11:08:57.028586   675 master.cpp:5459] Sending 1 inverse offers to framework fd39ca89-d7fd-4df8-ad50-dbb493d1cd7b-0000
(default)
> I0229 11:08:57.029039   675 master.cpp:5459] Sending 1 inverse offers to framework fd39ca89-d7fd-4df8-ad50-dbb493d1cd7b-0000
(default)
> {code}
> The ideal expected workflow for this test is something like:
> - The framework receives offers from master.
> - The framework updates its maintenance schedule.
> - The current offer is rescinded.
> - A new offer is received from the master with unavailability set.
> - After the agent goes for maintenance, an inverse offer is sent.
> For some reason, in the logs we see that the master is sending 2 inverse offers. The
test seems to pass as we just check for the initial inverse offer being present. This can
also be reproduced by a modified version of the original test.
> {code}
> // Test ensures that an offer will have an `unavailability` set if the
> // slave is scheduled to go down for maintenance.
> TEST_F(MasterMaintenanceTest, PendingUnavailabilityTest)
> {
>   Try<PID<Master>> master = StartMaster();
>   ASSERT_SOME(master);
>   MockExecutor exec(DEFAULT_EXECUTOR_ID);
>   Try<PID<Slave>> slave = StartSlave(&exec);
>   ASSERT_SOME(slave);
>   auto scheduler = std::make_shared<MockV1HTTPScheduler>();
>   EXPECT_CALL(*scheduler, heartbeat(_))
>     .WillRepeatedly(Return()); // Ignore heartbeats.
>   Future<Nothing> connected;
>   EXPECT_CALL(*scheduler, connected(_))
>     .WillOnce(FutureSatisfy(&connected))
>     .WillRepeatedly(Return()); // Ignore future invocations.
>   scheduler::TestV1Mesos mesos(master.get(), ContentType::PROTOBUF, scheduler);
>   AWAIT_READY(connected);
>   Future<Event::Subscribed> subscribed;
>   EXPECT_CALL(*scheduler, subscribed(_, _))
>     .WillOnce(FutureArg<1>(&subscribed));
>   Future<Event::Offers> normalOffers;
>   Future<Event::Offers> unavailabilityOffers;
>   Future<Event::Offers> inverseOffers;
>   EXPECT_CALL(*scheduler, offers(_, _))
>     .WillOnce(FutureArg<1>(&normalOffers))
>     .WillOnce(FutureArg<1>(&unavailabilityOffers))
>     .WillOnce(FutureArg<1>(&inverseOffers));
>   // The original offers should be rescinded when the unavailability is changed.
>   Future<Nothing> offerRescinded;
>   EXPECT_CALL(*scheduler, rescind(_, _))
>     .WillOnce(FutureSatisfy(&offerRescinded));
>   {
>     Call call;
>     call.set_type(Call::SUBSCRIBE);
>     Call::Subscribe* subscribe = call.mutable_subscribe();
>     subscribe->mutable_framework_info()->CopyFrom(DEFAULT_V1_FRAMEWORK_INFO);
>     mesos.send(call);
>   }
>   AWAIT_READY(subscribed);
>   v1::FrameworkID frameworkId(subscribed->framework_id());
>   AWAIT_READY(normalOffers);
>   EXPECT_NE(0, normalOffers->offers().size());
>   // Regular offers shouldn't have unavailability.
>   foreach (const v1::Offer& offer, normalOffers->offers()) {
>     EXPECT_FALSE(offer.has_unavailability());
>   }
>   // Schedule this slave for maintenance.
>   MachineID machine;
>   machine.set_hostname(maintenanceHostname);
>   machine.set_ip(stringify(slave.get().address.ip));
>   const Time start = Clock::now() + Seconds(60);
>   const Duration duration = Seconds(120);
>   const Unavailability unavailability = createUnavailability(start, duration);
>   // Post a valid schedule with one machine.
>   maintenance::Schedule schedule = createSchedule(
>       {createWindow({machine}, unavailability)});
>   // We have a few seconds between the first set of offers and the
>   // next allocation of offers. This should be enough time to perform
>   // a maintenance schedule update. This update will also trigger the
>   // rescinding of offers from the scheduled slave.
>   Future<Response> response = process::http::post(
>       master.get(),
>       "maintenance/schedule",
>       headers,
>       stringify(JSON::protobuf(schedule)));
>   AWAIT_EXPECT_RESPONSE_STATUS_EQ(OK().status, response);
>   // The original offers should be rescinded when the unavailability
>   // is changed.
>   AWAIT_READY(offerRescinded);
>   AWAIT_READY(unavailabilityOffers);
>   EXPECT_NE(0, unavailabilityOffers->offers().size());
>   // Make sure the new offers have the unavailability set.
>   foreach (const v1::Offer& offer, unavailabilityOffers->offers()) {
>     EXPECT_TRUE(offer.has_unavailability());
>     EXPECT_EQ(
>         unavailability.start().nanoseconds(),
>         offer.unavailability().start().nanoseconds());
>     EXPECT_EQ(
>         unavailability.duration().nanoseconds(),
>         offer.unavailability().duration().nanoseconds());
>   }
>   // We also expect an inverse offer for the slave to go under
>   // maintenance.
>   AWAIT_READY(inverseOffers);
>   EXPECT_NE(0, inverseOffers->inverse_offers().size());
>   EXPECT_CALL(exec, shutdown(_))
>     .Times(AtMost(1));
>   EXPECT_CALL(*scheduler, disconnected(_))
>     .Times(AtMost(1));
>   Shutdown(); // Must shutdown before 'containerizer' gets deallocated.
> }
> {code}
> Also, unrelated, we need to clean up this test to not expect multiple offers i.e. remove
{{numberOfOffers}} constant.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message