mesos-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Chengwei Yang <chengwei.yang...@gmail.com>
Subject Re: When does scheduler driver send the LaunchTasksMessage
Date Wed, 04 Feb 2015 01:06:44 GMT
On Tue, Feb 03, 2015 at 07:28:37AM -0800, Adam Bordelon wrote:
> Just make sure you only send one LaunchTasksMessage per slave, although that
> message could contain multiple tasks launched on a collection of offers from
> the same slave.

Yes, generally we use the *deprecated* launchTasks with a single offer.

> You mention that launching 1000s in the same message causes Mesos to crash. Do
> you have a crash stack available for this?

See here.

https://issues.apache.org/jira/browse/MESOS-1804
https://issues.apache.org/jira/browse/MESOS-1795

> You shouldn't have to respond to all offers received before tasks get launched.

Thanks!

> Some frameworks "hoard" offers in case they want to launch something on them
> later, but launch other tasks in the meantime. Perhaps the delay has something
> to do with Chronos' cron-like scheduling feature?

I'll confirm this and keep you update.

--
Thanks,
Chengwei

> 
> On Tue, Feb 3, 2015 at 5:46 AM, Chengwei Yang <chengwei.yang.cn@gmail.com>
> wrote:
> 
>     Hi List,
> 
>     We are running chronos on mesos 0.19.0 and found a interesting problem,
>     that if
>     we were trying to launch about 1k tasks in a single resourceOffers(), it
>     may crash
>     and no tasks started by mesos at all.
> 
>     So we did a test like below:
> 
>     change code in chronos resourceOffers() callback as below:
> 
>     1. print log
>     2. decline the first offer in bunch of offers
>     3. sleep 30 seconds
>     4. decline all the offers received
> 
>     add a log in src/master/master.cpp to print some log whenever received a
>     LaunchTasksMessage, see below log.
> 
>     -----------8<-----------------------
>     I0203 18:32:33.169342  7680 master.cpp:2939] Sending 3 offers to framework
>     20150203-174243-2487817994-5050-10996-0000
>     I0203 18:32:39.523227  7670 http.cpp:452] HTTP request for '/master/
>     state.json'
>     I0203 18:32:49.601284  7674 http.cpp:452] HTTP request for '/master/
>     state.json'
>     I0203 18:32:59.677875  7677 http.cpp:452] HTTP request for '/master/
>     state.json'
>     I0203 18:33:03.390188  7676 master.cpp:1754] Received launchTasks message
>     for offer [ 20150203-183014-2487817994-5050-7668-0 ] of framework
>     20150203-174243-2487817994-5050-10996-0000
>     I0203 18:33:03.390949  7676 master.cpp:1895] Processing reply for offers: [
>     20150203-183014-2487817994-5050-7668-0 ] on slave
>     20150203-183014-2487817994-5050-7668-2 at slave(1)@10.23.73.140:5051
>     (xulijian-mesos-online016-cqdx.qiyi.virtual) for framework
>     20150203-174243-2487817994-5050-10996-0000
>     I0203 18:33:03.391469  7676 master.cpp:1754] Received launchTasks message
>     for offer [ 20150203-183014-2487817994-5050-7668-0 ] of framework
>     20150203-174243-2487817994-5050-10996-0000
>     I0203 18:33:03.391791  7670 hierarchical_allocator_process.hpp:589]
>     Framework 20150203-174243-2487817994-5050-10996-0000 filtered slave
>     20150203-183014-2487817994-5050-7668-2 for 5secs
>     W0203 18:33:03.392019  7676 master.cpp:1871] Failed to validate offer
>     20150203-183014-2487817994-5050-7668-0: Offer
>     20150203-183014-2487817994-5050-7668-0 is no longer valid
>     I0203 18:33:03.393173  7676 master.cpp:1754] Received launchTasks message
>     for offer [ 20150203-183014-2487817994-5050-7668-1 ] of framework
>     20150203-174243-2487817994-5050-10996-0000
>     I0203 18:33:03.393601  7676 master.cpp:1895] Processing reply for offers: [
>     20150203-183014-2487817994-5050-7668-1 ] on slave
>     20150203-183014-2487817994-5050-7668-1 at slave(1)@10.23.73.141:5051
>     (xulijian-mesos-online017-cqdx.qiyi.virtual) for framework
>     20150203-174243-2487817994-5050-10996-0000
>     I0203 18:33:03.394057  7676 master.cpp:1754] Received launchTasks message
>     for offer [ 20150203-183014-2487817994-5050-7668-2 ] of framework
>     20150203-174243-2487817994-5050-10996-0000
>     I0203 18:33:03.394379  7679 hierarchical_allocator_process.hpp:589]
>     Framework 20150203-174243-2487817994-5050-10996-0000 filtered slave
>     20150203-183014-2487817994-5050-7668-1 for 5secs
>     I0203 18:33:03.394664  7676 master.cpp:1895] Processing reply for offers: [
>     20150203-183014-2487817994-5050-7668-2 ] on slave
>     20150203-183014-2487817994-5050-7668-0 at slave(1)@10.23.73.148:5051
>     (xulijian-mesos-online015-cqdx.qiyi.virtual) for framework
>     20150203-174243-2487817994-5050-10996-0000
>     I0203 18:33:03.395504  7676 hierarchical_allocator_process.hpp:589]
>     Framework 20150203-174243-2487817994-5050-10996-0000 filtered slave
>     20150203-183014-2487817994-5050-7668-0 for 5secs
>     ---------------8<-------------------
> 
>     As we can see, mesos-master send offer to chronos at 18:32:33, but received
>     all
>     4 decline message (LaunchTasksMessage) at 18:33.03, we are very curious why
>     the
>     first decline doesn't sent before sleep 30 seconds?
> 
>     >From the log, we see that the offer 0 is no longer valid because we
>     already send
>     a decline before.
> 
>     Does that mean we(the framework scheduler) have to reply for all offers
>     received
>     before we can launch any task?
> 
>     --
>     Thanks,
>     Chengwei
> 
> 

Mime
View raw message