mesos-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Philip Weaver <>
Subject Re: High latency when scheduling and executing many tiny tasks.
Date Fri, 17 Jul 2015 21:06:44 GMT
Awesome, I suspected that was the case, but hadn't discovered the
--allocation_interval flag, so I will use that.

I installed from the mesosphere RPMs and didn't change any flags from
there. I will try to find some logs that provide some insight into the
execution times.

I am using a command task. I haven't looked into executors yet; I had a
hard time finding some examples in my language (Scala).

On Fri, Jul 17, 2015 at 2:00 PM, Benjamin Mahler <>

> One other thing, do you use an executor to run many tasks? Or are you
> using a command task?
> On Fri, Jul 17, 2015 at 1:54 PM, Benjamin Mahler <
>> wrote:
>> Currently, recovered resources are not immediately re-offered as you
>> noticed, and the default allocation interval is 1 second. I'd recommend
>> lowering that (e.g. --allocation_interval=50ms), that should improve the
>> second bullet you listed. Although, in your case it would be better to
>> immediately re-offer recovered resources (feel free to file a ticket for
>> supporting that).
>> For the first bullet, mind providing some more information? E.g. master
>> flags, slave flags, scheduler logs, master logs, slave logs, executor logs?
>> We would need to trace through a task launch to see where the latency is
>> being introduced.
>> On Fri, Jul 17, 2015 at 12:26 PM, Philip Weaver <>
>> wrote:
>>> I'm trying to understand the behavior of mesos, and if what I am
>>> observing is typical or if I'm doing something wrong, and what options I
>>> have for improving the performance of how offers are made and how tasks are
>>> executed for my particular use case.
>>> I have written a Scheduler that has a queue of very small tasks (for
>>> testing, they are "echo hello world", but in production many of them won't
>>> be much more expensive than that). Each task is configured to use 1 cpu
>>> resource. When resourceOffers is called, I launch as many tasks as I can in
>>> the given offers; that is, one call to driver.launchTasks for each offer,
>>> with a list of tasks that has one task for each cpu in that offer.
>>> On a cluster of 3 nodes and 4 cores each (12 total cores), it takes 120s
>>> to execute 1000 tasks out of the queue. We are evaluting mesos because we
>>> want to use it to replace our current homegrown cluster controller, which
>>> can execute 1000 tasks in way less than 120s.
>>> I am seeing two things that concern me:
>>>    - The time between driver.launchTasks and receiving a callback to
>>>    statusUpdate when the task completes is typically 200-500ms, and sometimes
>>>    even as high as 1000-2000ms.
>>>    - The time between when a task completes and when I get an offer for
>>>    the newly freed resource is another 500ms or so.
>>> These latencies explain why I can only execute tasks at a rate of about
>>> 8/s.
>>> It looks like my offers always include all 4 cores on each machine,
>>> which would indicate that mesos doesn't like to send an offer as soon as a
>>> single resource is avaiable, and prefers to delay and send an offer with
>>> more resources in it. Is this true?
>>> Thanks in advance for any advice you can offer!
>>> - Phllip

View raw message