mesos-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Anand Mazumdar (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (MESOS-3302) Scheduler API v1 improvements
Date Wed, 25 May 2016 16:03:13 GMT

    [ https://issues.apache.org/jira/browse/MESOS-3302?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15300286#comment-15300286
] 

Anand Mazumdar commented on MESOS-3302:
---------------------------------------

Jay Guo Thanks for testing out the new API. Here are the answer to your queries:
* Restart leading master
** For a non HA cluster, the behavior is expected. The scheduler library does not currently
follow a redirect but merely relies on the detector to let it know of a new master. So, the
behavior is expected and correctly works for a HA cluster as you pointed out.
** We want to fix the behavior i.e. ensure there is a delay upon (re-)connection. https://issues.apache.org/jira/browse/MESOS-5359

* Restart agent
** Currently, the long lived framework does not support moving existing tasks across agents.
However, it would be good to test that the executor is correctly recovered upon agent restart
with checkpointing enabled. If checkpointing is disabled, it should kill itself.
** Also, restarting the agent with --http_command_executor enabled/disabled, should still
successfully recover all the executors.

* Emulate network partitions
** I am assuming that when you say "the framework hangs", you just means that it does not
have anything to do?
** "However there was once that agent keeps launching new tasks without framework being aware
of it during partition."
This is expected. If a framework is partitioned from the master after sending LAUNCH messages,
the agent would still go ahead and launch them. The framework would receive the status updates
for the running tasks upon re-registering since then agent keeps retrying the updates every
10 mins. We currently do not implement any reconciliation in the long running framework.
** Also, it would be good to test the other one way partition, i.e. the framework is partitioned
away from the master.

To reduce noise here on this improvement JIRA, we should create a google doc with the testing
details and link it to the JIRA? I would also add the testing details done by me to that doc
and consolidate them at one place. If it's easier for you, I can create the doc myself and
you can then add the details to it. Let me know what works for you.

> Scheduler API v1 improvements
> -----------------------------
>
>                 Key: MESOS-3302
>                 URL: https://issues.apache.org/jira/browse/MESOS-3302
>             Project: Mesos
>          Issue Type: Epic
>            Reporter: Marco Massenzio
>              Labels: mesosphere, twitter
>
> This Epic covers all the refinements that we may want to build on top of the {{HTTP API}}
MVP epic (MESOS-2288) which was released initially with Mesos {{0.24.0}}.
> The tasks/stories here cover the necessary work to bring the API v1 to what we would
regard as "Production-ready" state in preparation for the {{1.0.0}} release.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message