mesos-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Jay Guo (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (MESOS-3302) Scheduler API v1 improvements
Date Wed, 25 May 2016 14:58:12 GMT

    [ https://issues.apache.org/jira/browse/MESOS-3302?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15300176#comment-15300176
] 

Jay Guo commented on MESOS-3302:
--------------------------------

[~vinodkone]

We are manually testing HTTP APIs now and here are some observations:

*Cluster setup:*
* Bring up 3 masters, 3 agents, 3 zookeepers
* Agents should be started with --use_http_command_executor flag (which uses http command
executor)
* Start long lived framework (which uses http scheduler api)

*Test cases:*
* Restart leading master
_The framework is started with {{--master=<master-ip>}}. Therefore, it always talks
to fixed master no matter being leader or follower._ 
*Expected:* {{307 Temporary Redirect}} and scheduler actually handles redirect and talks
to real leader master, and these should be transparent to framework
*Actual:* It reports this back to framework.
Is this intended behaviour? On the other hand, when framework is started with --master=zk://...
it correctly handles master detection and resumes when new leader master is elected. Although
master detection happens continuously without a break. Do we consider to introduce an interval?

* Restart agent
*Expected:* Workload is migrated to other agents if current agent is down for a period longer
than timeout, therefore removed. If agent is resurrected within the timeout, it resumes the
tasks.
*Actual:* Framework keeps waiting for the agent to recover. It does resume working if agent
is back in time. Otherwise, it keeps waiting indefinitely.
I guess this is reasonable since that long-lived-framework declines other offers, which will
not be offered again to this framework. I don't see there's an option to expire the decline-offer-filter
though, or am I missing something?
There are also chances that the agent resumes running tasks for a little while and then _asked
to terminate_ by master. This is somewhat flaky, need to investigate further.

* Restart long lived framework
*Expected:* Recover
*Actual:* Recover

* Restart all masters at once
Same behaviour as _restarting leading master_

* Emulate network partitions (1 way - 2 way) between long lived framework and master
_network partition is emulated at tcp layer using iptables rule {{iptables -A INPUT -p tcp
-s <framework-ip> -dport 5050 -j DROP}}
** One-way: Master <--X-- Framework
For most cases it works as expected: framework simply hangs. Agent keeps resending messages
since acknowledgements are blocked. When block is lifted, everything resumes to work. However
there was once that agent keeps launching new tasks without framework being aware of it during
partition. Need to find a way to reproduce it. I guess it has something to do with the status
when network is cut.
** Two-way: WIP

* Restart leading Zookeeper
WIP

* Restart all Zookeepers at once
WIP

> Scheduler API v1 improvements
> -----------------------------
>
>                 Key: MESOS-3302
>                 URL: https://issues.apache.org/jira/browse/MESOS-3302
>             Project: Mesos
>          Issue Type: Epic
>            Reporter: Marco Massenzio
>              Labels: mesosphere, twitter
>
> This Epic covers all the refinements that we may want to build on top of the {{HTTP API}}
MVP epic (MESOS-2288) which was released initially with Mesos {{0.24.0}}.
> The tasks/stories here cover the necessary work to bring the API v1 to what we would
regard as "Production-ready" state in preparation for the {{1.0.0}} release.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message