mesos-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Joseph Wu <jos...@mesosphere.io>
Subject Re: Master slow to process status updates after massive killing of tasks?
Date Fri, 17 Jun 2016 22:26:11 GMT
A couple questions about your test:

- By "killed off", were your agents killed permanently (i.e. powered off)
or temporarily (i.e. network partition).  And how long were your agents
killed/down during the test?
- How many of the 1000 accidentally killed tasks were running on your
killed-off agents vs your normal agents?
- What's the timeframe of the test?  Did everything happen in minutes?
Hours?
- Are you only running Singularity?  Any other frameworks in the picture?

Questions about your setup:

- Are you monitoring master metrics while this is happening?  The flood of
TASK_FILLED status updates may be filling up the master's event queue
(master/event_queue_messages).
- Are you monitoring the master's CPU usage?  If maxed out, the bottleneck
is probably the master :(
- Where is your framework running with regards to the master?  Is network
bandwidth/latency limited?
- Can you check the time between your framework accepting an offer and
master logging "Processing ACCEPT call for offers..."?

Questions about Singularity:

- Does Singularity handle status update acknowledgements explicitly?  Or
does it leave this up to the old scheduler driver (default)?
- When does Singularity use the reconcileTasks call?  That is the source of
the "Performing explicit task state reconciliation..." master log line.
This might be contributing to the slowness.


On Fri, Jun 17, 2016 at 2:27 PM, Thomas Petr <tpetr@hubspot.com> wrote:

> Hey folks,
>
> We got our Mesos cluster (0.28.1) into an interesting state during some
> chaos monkey testing. I killed off 5 of our 16 agents to simulate an AZ
> outage, and then accidentally killed off almost all running tasks (a little
> more than 1,000 of our ~1,300 tasks -- not intentional but an interesting
> test nonetheless!). During this time, we noticed:
>
> - The time between our framework accepting an offer and the master
> considering the task as launched spiked to ~2 minutes (which became doubly
> problematic due to our 2 minute offer timeout)
> - It would take up to 8 minutes for TASK_KILLED status updates from a
> slave to be acknowledged by the master.
> - The master logs contained tons of log lines mentioning "Performing
> explicit task state reconciliation..."
> - The killed agents took ~5 minutes to recover after I booted them back up.
> - The whole time, resources were offered to the framework at a normal rate.
>
> I understand that this is an exceptional situation, but does anyone ahve
> any insight into exactly what's going on behind the scenes? Sounds like all
> the status updates were backed up in a queue and the master was processing
> them one at a time. Is there something we could have done better in our
> framework <https://github.com/hubspot/singularity> to do this more
> gracefully? Is there any sort of monitoring of the master backlog that we
> can take advantage of?
>
> Happy to include master / slave / framework logs if necessary.
>
> Thanks,
> Tom
>

Mime
View raw message