mesos-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Itamar Ostricher <>
Subject Re: Recommended resources for master / scheduler machines
Date Sat, 10 Jan 2015 13:56:25 GMT
Interesting. I knew I needed to look into ZooKeeper more than I did :-)

I don't know what's "distributed mode" in ZooKeeper. I can tell you we use
a single host for the master, and configure all machines with
"zk://master-host-name:2181/mesos" in /etc/mesos/zk before the mesos
services are started.

We don't assign a dedicated device to ZooKeeper, so maybe it bites us...

On Thu, Jan 8, 2015 at 9:33 PM, Tomas Barton <> wrote:

> Is ZooKeeper running in distributed mode?
> ZooKeeper is writes periodically all data to disk (transaction log), so
> the bottleneck could be ZooKeeper rather than
> not enough CPUs. ZooKeeper limits each key to 1MB, typically 512MB should
> be enough for ZooKeeper (or 4GB
> might not be enough, depends on your use-case).
> from ZooKeeper docs:
> ZooKeeper's transaction log must be on a dedicated device. (A dedicated
> partition is not enough.) ZooKeeper writes the log sequentially, without
> seeking Sharing your log device with other processes can cause seeks and
> contention, which in turn can cause multi-second delays.
>  In particular, you should not create a situation in which ZooKeeper swaps
> to disk. The disk is death to ZooKeeper. Everything is ordered, so if
> processing one request swaps the disk, all other queued requests will
> probably do the same. the disk. DON'T SWAP.
> On 8 January 2015 at 16:47, Itamar Ostricher <> wrote:
>> Thanks Tomas.
>> We're still quite far from the 10k-20k machines limit :-)
>> Currently, our framework scheduler generates many (millions) of mostly
>> small tasks (some in the ~100ms, some in the few seconds).
>> I understand that the network is the main bottleneck, but we sometimes
>> experience lost tasks, and sometimes I see master logs indicating that the
>> master is unable to talk with the zookeeper service (which is on the same
>> host), and I was wondering if it's related to CPU/RAM of the master machine.
>> Is 1 CPU enough? 2? 4?
>> 1GiB RAM? 4? 8?
>> On Thu, Jan 8, 2015 at 5:00 PM, Tomas Barton <>
>> wrote:
>>> Hi Itamar,
>>> there's definitely certain limit of machines which can Mesos master
>>> handle. This limit is between 10 000 - 20 000 (that's number
>>> reported by Twitter). This bottleneck is caused by event loop which
>>> handles communication at master.
>>> With hundreds of machines you should be fine. Only in case that your
>>> framework scheduler would demand
>>> too many resources for computing allocations you might encounter some
>>> problems.
>>> How does the strength of the master & scheduler machines affect the
>>>> overall cluster performance?
>>> I would say that the network is usually the main bottleneck. Adding
>>> extra RAM won't improve mesos-master
>>> performance. Of course if there's high CPU load on master you might
>>> observe performance regression. Also
>>> this depends on granularity of your tasks, if you have few long running
>>> tasks or many short tasks (which runs
>>> just hundreds of ms).
>>> Tomas
>>> On 6 January 2015 at 10:12, Itamar Ostricher <> wrote:
>>>> Are there recommendations regarding master / scheduler machines
>>>> resources as function of cluster size?
>>>> Say I have a cluster with hundreds of slave machines and thousands of
>>>> CPUs, with a single framework that will schedule millions of tasks.
>>>> How does the strength of the master & scheduler machines affect the
>>>> overall cluster performance?
>>>> Thanks,
>>>> - Itamar.

View raw message