mesos-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Huadong Liu (JIRA)" <j...@apache.org>
Subject [jira] [Comment Edited] (MESOS-7882) Mesos master rescinds all the in-flight offers from all the registered agents when a new maintenance schedule is posted for a subset of slaves
Date Mon, 14 Aug 2017 19:56:00 GMT

    [ https://issues.apache.org/jira/browse/MESOS-7882?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16126305#comment-16126305
] 

Huadong Liu edited comment on MESOS-7882 at 8/14/17 7:55 PM:
-------------------------------------------------------------

I was able to repro the problem. The test setup has two mesos agents
{noformat}
af584a07-7b1c-4955-861e-63585af8bb5d-S0: 10.255.55.153
af584a07-7b1c-4955-861e-63585af8bb5d-S1: 10.255.52.14
{noformat}

The modified example framework is going to hold received offers for 30 seconds and it only
launches tasks on S0.
{noformat}
diff --git a/src/examples/python/test_framework.py b/src/examples/python/test_framework.py
     def resourceOffers(self, driver, offers):
+        time.sleep(30)
         for offer in offers:
+            if 'af584a07-7b1c-4955-861e-63585af8bb5d-S1' == offer.slave_id.value:
+                print("ignore offers from af584a07-7b1c-4955-861e-63585af8bb5d-S1")
+                continue
             tasks = []
{noformat} 

Start test-framework, and post a maintenance schedule of S1 on another terminal while test-framework
is in sleep.
{noformat}
~/mesos/build$ ./src/examples/python/test-framework 10.255.52.14:5050
I0814 11:48:21.296404  4182 sched.cpp:232] Version: 1.3.0
I0814 11:48:21.301652  4222 sched.cpp:336] New master detected at master@10.255.52.14:5050
I0814 11:48:21.302145  4222 sched.cpp:352] No credentials provided. Attempting to register
without authentication
I0814 11:48:21.306299  4224 sched.cpp:759] Framework registered with af584a07-7b1c-4955-861e-63585af8bb5d-0014
Registered with framework ID af584a07-7b1c-4955-861e-63585af8bb5d-0014

---------------------
$ cat schedule.json
{
  "windows" : [
    {
      "machine_ids" : [
        { "ip" : "10.255.52.14" }
      ],
      "unavailability" : {
        "start" : { "nanoseconds" : 1502734375000000000 },
        "duration" : { "nanoseconds" : 3600000000000 }
      }
    }
  ]
}
$ curl http://10.255.52.14:5050/maintenance/schedule -H "Content-type: application/json" -X
POST -d @schedule.json
----------------

Received offer af584a07-7b1c-4955-861e-63585af8bb5d-O153 with cpus: 3.0 and mem: 2927.0
Launching task 0 using offer af584a07-7b1c-4955-861e-63585af8bb5d-O153
Launching task 1 using offer af584a07-7b1c-4955-861e-63585af8bb5d-O153
Launching task 2 using offer af584a07-7b1c-4955-861e-63585af8bb5d-O153
ignore offers from af584a07-7b1c-4955-861e-63585af8bb5d-S1
ignore offers from af584a07-7b1c-4955-861e-63585af8bb5d-S1
Received offer af584a07-7b1c-4955-861e-63585af8bb5d-O156 with cpus: 3.0 and mem: 2927.0
Launching task 3 using offer af584a07-7b1c-4955-861e-63585af8bb5d-O156
Launching task 4 using offer af584a07-7b1c-4955-861e-63585af8bb5d-O156
W0814 11:49:51.406801  4218 sched.cpp:1371] Attempting to accept an unknown offer af584a07-7b1c-4955-861e-63585af8bb5d-O153
Task 0 is in state TASK_LOST
{noformat}

Mesos master log while this is happening is captured below:
{noformat}
I0814 11:48:21.302987  1530 master.cpp:2596] Received SUBSCRIBE call for framework 'Test Framework
(Python)' at scheduler-6d672749-4414-4266-adfc-2b7ff5694d5b@10.255.52.14:45893
I0814 11:48:21.303450  1530 master.cpp:2672] Subscribing framework Test Framework (Python)
with checkpointing enabled and capabilities [  ]
I0814 11:48:21.304566  1529 hierarchical.cpp:275] Added framework af584a07-7b1c-4955-861e-63585af8bb5d-0014
I0814 11:48:21.306139  1530 master.cpp:6517] Sending 2 offers to framework af584a07-7b1c-4955-861e-63585af8bb5d-0014
(Test Framework (Python)) at scheduler-6d672749-4414-4266-adfc-2b7ff5694d5b@10.255.52.14:45893
I0814 11:48:25.076035  1533 http.cpp:391] HTTP POST for /master/maintenance/schedule from
10.255.55.153:37186 with User-Agent='curl/7.47.0'
I0814 11:48:25.077271  1533 registrar.cpp:461] Applied 1 operations in 272915ns; attempting
to update the registry
I0814 11:48:25.078277  1533 coordinator.cpp:348] Coordinator attempting to write APPEND action
at position 39
I0814 11:48:25.079033  1533 replica.cpp:537] Replica received write request for position 39
from __req_res__(44)@10.255.52.14:5050
I0814 11:48:25.082299  1531 replica.cpp:691] Replica received learned notice for position
39 from @0.0.0.0:0
I0814 11:48:25.085546  1531 registrar.cpp:506] Successfully updated the registry in 8.176128ms
I0814 11:48:25.085726  1535 coordinator.cpp:348] Coordinator attempting to write TRUNCATE
action at position 40
I0814 11:48:25.086496  1528 master.cpp:5645] Removing unavailability of agent af584a07-7b1c-4955-861e-63585af8bb5d-S1
at slave(1)@10.255.52.14:5051 (10.255.52.14)
I0814 11:48:25.086550  1530 replica.cpp:537] Replica received write request for position 40
from __req_res__(45)@10.255.52.14:5050
I0814 11:48:25.087936  1530 replica.cpp:691] Replica received learned notice for position
40 from @0.0.0.0:0
I0814 11:48:25.088673  1528 master.cpp:5645] Removing unavailability of agent af584a07-7b1c-4955-861e-63585af8bb5d-S0
at slave(1)@10.255.55.153:5051 (10.255.55.153)
I0814 11:48:25.089725  1528 master.cpp:6517] Sending 1 offers to framework af584a07-7b1c-4955-861e-63585af8bb5d-0014
(Test Framework (Python)) at scheduler-6d672749-4414-4266-adfc-2b7ff5694d5b@10.255.52.14:45893
I0814 11:48:25.090461  1529 master.cpp:6517] Sending 1 offers to framework af584a07-7b1c-4955-861e-63585af8bb5d-0014
(Test Framework (Python)) at scheduler-6d672749-4414-4266-adfc-2b7ff5694d5b@10.255.52.14:45893
W0814 11:49:51.408465  1534 master.cpp:3494] Ignoring accept of offer af584a07-7b1c-4955-861e-63585af8bb5d-O153
since it is no longer valid
W0814 11:49:51.408888  1534 master.cpp:3505] ACCEPT call used invalid offers '[ af584a07-7b1c-4955-861e-63585af8bb5d-O153
]': Offer af584a07-7b1c-4955-861e-63585af8bb5d-O153 is no longer valid
I0814 11:49:51.409276  1534 master.cpp:5772] Sending status update TASK_LOST for task 0 of
framework af584a07-7b1c-4955-861e-63585af8bb5d-0014 'Task launched with invalid offers: Offer
af584a07-7b1c-4955-861e-63585af8bb5d-O153 is no longer valid'
I0814 11:49:51.409920  1534 master.cpp:5772] Sending status update TASK_LOST for task 1 of
framework af584a07-7b1c-4955-861e-63585af8bb5d-0014 'Task launched with invalid offers: Offer
af584a07-7b1c-4955-861e-63585af8bb5d-O153 is no longer valid'
I0814 11:49:51.410332  1534 master.cpp:5772] Sending status update TASK_LOST for task 2 of
framework af584a07-7b1c-4955-861e-63585af8bb5d-0014 'Task launched with invalid offers: Offer
af584a07-7b1c-4955-861e-63585af8bb5d-O153 is no longer valid'
{noformat}


was (Author: huadongliu):
I was able to repro the problem. My setup has two mesos agents
{noformat}
af584a07-7b1c-4955-861e-63585af8bb5d-S0: 10.255.55.153
af584a07-7b1c-4955-861e-63585af8bb5d-S1: 10.255.52.14
{noformat}

The modified example framework is going to hold received offers for 30 seconds and only launch
tasks on S0.
{noformat}
diff --git a/src/examples/python/test_framework.py b/src/examples/python/test_framework.py
     def resourceOffers(self, driver, offers):
+        time.sleep(30)
         for offer in offers:
+            if 'af584a07-7b1c-4955-861e-63585af8bb5d-S1' == offer.slave_id.value:
+                print("ignore offers from af584a07-7b1c-4955-861e-63585af8bb5d-S1")
+                continue
             tasks = []
{noformat} 

Start test-framework, and post a maintenance schedule of S1 on another terminal while the
framework is in sleep.
{noformat}
~/mesos/build$ ./src/examples/python/test-framework 10.255.52.14:5050
I0814 11:48:21.296404  4182 sched.cpp:232] Version: 1.3.0
I0814 11:48:21.301652  4222 sched.cpp:336] New master detected at master@10.255.52.14:5050
I0814 11:48:21.302145  4222 sched.cpp:352] No credentials provided. Attempting to register
without authentication
I0814 11:48:21.306299  4224 sched.cpp:759] Framework registered with af584a07-7b1c-4955-861e-63585af8bb5d-0014
Registered with framework ID af584a07-7b1c-4955-861e-63585af8bb5d-0014

---------------------
$ cat schedule.json
{
  "windows" : [
    {
      "machine_ids" : [
        { "ip" : "10.255.52.14" }
      ],
      "unavailability" : {
        "start" : { "nanoseconds" : 1502734375000000000 },
        "duration" : { "nanoseconds" : 3600000000000 }
      }
    }
  ]
}
$ curl http://10.255.52.14:5050/maintenance/schedule -H "Content-type: application/json" -X
POST -d @schedule.json
----------------

Received offer af584a07-7b1c-4955-861e-63585af8bb5d-O153 with cpus: 3.0 and mem: 2927.0
Launching task 0 using offer af584a07-7b1c-4955-861e-63585af8bb5d-O153
Launching task 1 using offer af584a07-7b1c-4955-861e-63585af8bb5d-O153
Launching task 2 using offer af584a07-7b1c-4955-861e-63585af8bb5d-O153
ignore offers from af584a07-7b1c-4955-861e-63585af8bb5d-S1
ignore offers from af584a07-7b1c-4955-861e-63585af8bb5d-S1
Received offer af584a07-7b1c-4955-861e-63585af8bb5d-O156 with cpus: 3.0 and mem: 2927.0
Launching task 3 using offer af584a07-7b1c-4955-861e-63585af8bb5d-O156
Launching task 4 using offer af584a07-7b1c-4955-861e-63585af8bb5d-O156
W0814 11:49:51.406801  4218 sched.cpp:1371] Attempting to accept an unknown offer af584a07-7b1c-4955-861e-63585af8bb5d-O153
Task 0 is in state TASK_LOST
{noformat}

Mesos master logs while this is happening:
{noformat}
I0814 11:48:21.302987  1530 master.cpp:2596] Received SUBSCRIBE call for framework 'Test Framework
(Python)' at scheduler-6d672749-4414-4266-adfc-2b7ff5694d5b@10.255.52.14:45893
I0814 11:48:21.303450  1530 master.cpp:2672] Subscribing framework Test Framework (Python)
with checkpointing enabled and capabilities [  ]
I0814 11:48:21.304566  1529 hierarchical.cpp:275] Added framework af584a07-7b1c-4955-861e-63585af8bb5d-0014
I0814 11:48:21.306139  1530 master.cpp:6517] Sending 2 offers to framework af584a07-7b1c-4955-861e-63585af8bb5d-0014
(Test Framework (Python)) at scheduler-6d672749-4414-4266-adfc-2b7ff5694d5b@10.255.52.14:45893
I0814 11:48:25.076035  1533 http.cpp:391] HTTP POST for /master/maintenance/schedule from
10.255.55.153:37186 with User-Agent='curl/7.47.0'
I0814 11:48:25.077271  1533 registrar.cpp:461] Applied 1 operations in 272915ns; attempting
to update the registry
I0814 11:48:25.078277  1533 coordinator.cpp:348] Coordinator attempting to write APPEND action
at position 39
I0814 11:48:25.079033  1533 replica.cpp:537] Replica received write request for position 39
from __req_res__(44)@10.255.52.14:5050
I0814 11:48:25.082299  1531 replica.cpp:691] Replica received learned notice for position
39 from @0.0.0.0:0
I0814 11:48:25.085546  1531 registrar.cpp:506] Successfully updated the registry in 8.176128ms
I0814 11:48:25.085726  1535 coordinator.cpp:348] Coordinator attempting to write TRUNCATE
action at position 40
I0814 11:48:25.086496  1528 master.cpp:5645] Removing unavailability of agent af584a07-7b1c-4955-861e-63585af8bb5d-S1
at slave(1)@10.255.52.14:5051 (10.255.52.14)
I0814 11:48:25.086550  1530 replica.cpp:537] Replica received write request for position 40
from __req_res__(45)@10.255.52.14:5050
I0814 11:48:25.087936  1530 replica.cpp:691] Replica received learned notice for position
40 from @0.0.0.0:0
I0814 11:48:25.088673  1528 master.cpp:5645] Removing unavailability of agent af584a07-7b1c-4955-861e-63585af8bb5d-S0
at slave(1)@10.255.55.153:5051 (10.255.55.153)
I0814 11:48:25.089725  1528 master.cpp:6517] Sending 1 offers to framework af584a07-7b1c-4955-861e-63585af8bb5d-0014
(Test Framework (Python)) at scheduler-6d672749-4414-4266-adfc-2b7ff5694d5b@10.255.52.14:45893
I0814 11:48:25.090461  1529 master.cpp:6517] Sending 1 offers to framework af584a07-7b1c-4955-861e-63585af8bb5d-0014
(Test Framework (Python)) at scheduler-6d672749-4414-4266-adfc-2b7ff5694d5b@10.255.52.14:45893
W0814 11:49:51.408465  1534 master.cpp:3494] Ignoring accept of offer af584a07-7b1c-4955-861e-63585af8bb5d-O153
since it is no longer valid
W0814 11:49:51.408888  1534 master.cpp:3505] ACCEPT call used invalid offers '[ af584a07-7b1c-4955-861e-63585af8bb5d-O153
]': Offer af584a07-7b1c-4955-861e-63585af8bb5d-O153 is no longer valid
I0814 11:49:51.409276  1534 master.cpp:5772] Sending status update TASK_LOST for task 0 of
framework af584a07-7b1c-4955-861e-63585af8bb5d-0014 'Task launched with invalid offers: Offer
af584a07-7b1c-4955-861e-63585af8bb5d-O153 is no longer valid'
I0814 11:49:51.409920  1534 master.cpp:5772] Sending status update TASK_LOST for task 1 of
framework af584a07-7b1c-4955-861e-63585af8bb5d-0014 'Task launched with invalid offers: Offer
af584a07-7b1c-4955-861e-63585af8bb5d-O153 is no longer valid'
I0814 11:49:51.410332  1534 master.cpp:5772] Sending status update TASK_LOST for task 2 of
framework af584a07-7b1c-4955-861e-63585af8bb5d-0014 'Task launched with invalid offers: Offer
af584a07-7b1c-4955-861e-63585af8bb5d-O153 is no longer valid'
{noformat}

> Mesos master rescinds all the in-flight offers from all the registered agents when a
new maintenance schedule is posted for a subset of slaves
> ----------------------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: MESOS-7882
>                 URL: https://issues.apache.org/jira/browse/MESOS-7882
>             Project: Mesos
>          Issue Type: Bug
>          Components: master
>    Affects Versions: 1.3.0
>         Environment: Ubuntu 14:04(trusty)
> Mesos master branch.
> SHA: a31dd52ab71d2a529b55cd9111ec54acf7550ded
>            Reporter: Sagar Sadashiv Patwardhan
>            Priority: Minor
>
> We are running mesos 1.1.0 in production. We use a custom autoscaler for scaling our
mesos  cluster up and down. While scaling down the cluster, autoscaler makes a POST request
to mesos master /maintenance/schedule endpoint with a set of slaves to move to maintenance
mode. This forces mesos master to rescind all the in-flight offers from *all the slaves* in
the cluster. If our scheduler accepts one of these offers, then we get a TASK_LOST status
update back for that task. We also see such (https://gist.github.com/sagar8192/8858e7cb59a23e8e1762a27571824118)
log lines in mesos master logs.
> After reading the code(refs: https://github.com/apache/mesos/blob/master/src/master/master.cpp#L6772),
it appears that offers are getting rescinded for all the slaves. I am not sure what is the
expected behavior here, but it makes more sense if only resources from slaves marked for maintenance
are reclaimed.
> *Experiment:*
> To verify that it is actually happening, I checked out the master branch(sha: a31dd52ab71d2a529b55cd9111ec54acf7550ded
) and added some log lines(https://gist.github.com/sagar8192/42ca055720549c5ff3067b1e6c7c68b3).
Built the binary and started a mesos master and 2 agent processes. Used a basic python framework
that launches docker containers on these slaves. Verified that there is no existing schedule
for any slaves using `curl 10.40.19.239:5050/maintenance/status`. Posted maintenance schedule
for one of the slaves(https://gist.github.com/sagar8192/fb65170240dd32a53f27e6985c549df0)
after starting the mesos framework.
> *Logs:*
> mesos-master: https://gist.github.com/sagar8192/91888419fdf8284e33ebd58351131203
> mesos-slave1: https://gist.github.com/sagar8192/3a83364b1f5ffc63902a80c728647f31
> mesos-slave2: https://gist.github.com/sagar8192/1b341ef2271dde11d276974a27109426
> Mesos framework: https://gist.github.com/sagar8192/bcd4b37dba03bde0a942b5b972004e8a
> I think mesos should rescind offers and inverse offers only for those slaves that are
marked for maintenance(draining mode).



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

Mime
View raw message