mesos-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Martin StiborskĂ˝ <>
Subject zookeeper quorum failing because of high network load
Date Mon, 27 Apr 2015 08:58:44 GMT
Hello guys,
we are running a mesos stack on CoreOS, with three zookeeper nodes.

We can start a docker containers with Marathon and all, that's fine, but
some of the docker containers generates high network load, while
communicating between nodes/containers and I think that' the reason why the
zookeper is failing.
>From logs, I can see this error:

Apr 27 05:06:15 systemd[1]: Stopping Zookeper
Apr 27 05:06:45 docker[1155]: 2015-04-27 05:06:45,705
[myid:1] - WARN  [NIOServerCxn.Factory:] - caught end of stream
Apr 27 05:06:45 docker[1155]: EndOfStreamException:
Unable to read additional data from client sessionid 0x14cf73508730003,
likely client has closed socket
Apr 27 05:06:45 docker[1155]: at
Apr 27 05:06:45 docker[1155]: at
Apr 27 05:06:45 docker[1155]: at
Apr 27 05:06:45 docker[1155]: 2015-04-27 05:06:45,707
[myid:1] - INFO  [NIOServerCxn.Factory:] - Closed socket connec
tion for client / which had sessionid 0x14cf73508730003

And then all ZK nodes goes down…mesos fails as well and that's it. The
cluster eventually do recover, but the tasks running are gone, not finished.

I have to say I don't have a proper monitoring in place yet, working on it
right now, so I can't rely on real data to prove this assumption, but it's
my guess.
So if you can confirm that this makes sense, or share with me your
experiences, that would be pretty valuable for me right now.

Thanks a lot!

View raw message