mesos-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Martin Stiborský <martin.stibor...@gmail.com>
Subject Re: zookeeper quorum failing because of high network load
Date Tue, 28 Apr 2015 15:28:44 GMT
Now I finally tracked down the real problem, and it's nothing related to
mesos at all.
It was fleet on CoreOS stopping all containers on a node, because the node
was considered as unresponsive, from the CoreOS/etcd/fleet cluster point of
view.
The high cpu/network load caused the problem and fleet decided to stop the
services on the node in order to run them on another node.
Now in retrospective it of course sounds like pretty clear thing and it's
true that I should have looked on fleet log first, my bad.
The solution is a slight tunning of etcd and fleet parameters, like they
did here for example:
https://github.com/deis/deis/pull/1689/files

Thanks a lot guys for you effort, it helped!

On Tue, Apr 28, 2015 at 10:58 AM Ondrej Smola <ondrej.smola@gmail.com>
wrote:

> Hi Martin,
>
> do all 3 zookeepers go down with same error logs/cause - there should be
> some info as one node failure should not cause ZK to fail (as quorum is
> maintained) and remaining nodes should at least show some info from failure
> detector.
> The original log you posted are after stopping zookeeper - I saw these
> logs very frequently when i run Apache Storm in local/devel mode and
> terminate it from IDE - i think they are due to forcible stopping ZK (from
> timestamps there is 30 second timeout) - but i never saw them in
> production/non local mode. The problem should be described in log lines
> before  systemd[1]: Stopping Zookeper server... Could you please post
> preceding lines.
>
> It is only specific to deployed application (db + app image) - other
> applications are running OK?
>
>
>
>
>
>
>
> 2015-04-28 10:24 GMT+02:00 Martin Stiborský <martin.stiborsky@gmail.com>:
>
>> Hi guys,
>> these machines are relatively beefy - Dell PowerEdge r710 with 2x QC
>> Xeon, 144GB RAM, CoreOS is deployed on baremetal.
>> - ZK is running on the same 3 nodes as the mesos cluster
>> - our application is not using ZK
>> - nothing else running on the stack, only 1 mesos master, 3 mesos slaves
>>  and marathon, all of this on top of CoreOS booted from iPXE from network
>> - ZK log is not on dedicated disk, I can put it on NFS share
>>
>> The pattern is always the same. We start first container on the first
>> node, it's a database, then we run the second container with our
>> application on the second cluster node, the application loads data from
>> the database container on first node, then after about 6 minute, the stack
>> goes down.
>>
>> If we run both containers on same node, it's fine. That's why I tend to
>> blame network, but can't find the problem.
>>
>> On Tue, Apr 28, 2015 at 7:33 AM Charles Baker <cnobleb@gmail.com> wrote:
>>
>>> Hi Martin. Are these VMs or bare-metal? Is ZK running on the same 3
>>> nodes as the mesos cluster? Does your application also use ZooKeeper to
>>> manage it's own state? Are there any other services running on the machines
>>> and does Mesos and ZK have enough resources? And as Tomas asked; is your ZK
>>> log on a dedicated disk?
>>>
>>>
>>> On Mon, Apr 27, 2015 at 11:20 AM Martin Stiborský <
>>> martin.stiborsky@gmail.com> wrote:
>>>
>>>> Hi,
>>>> there are 3 zookeepers nodes.
>>>> We've started our containers and this time I was watching the
>>>> zookeepers and their condition with the "stat" command.
>>>> It seems that zookeeper latency is not the issue, there was only about
>>>> 8 connections, max latency time 134ms.
>>>>
>>>> I'm still not sure what is the real cause here…from mesos-master log I
>>>> see normal behaviour and the suddenly:
>>>> Apr 27 18:02:37 systemd[1]: mesos-master@1.service: main process
>>>> exited, code=exited, status=137/n/a
>>>>
>>>> If we run our containers all on one mesos-slave node, it works, but
>>>> when distributed to three nodes, it's failing.
>>>>
>>>>
>>>> On Mon, Apr 27, 2015 at 11:32 AM Tomas Barton <barton.tomas@gmail.com>
>>>> wrote:
>>>>
>>>>> Hi Martin,
>>>>>
>>>>> how many ZooKeepers do you have? Is your transaction log on a
>>>>> dedicated disk? How many clients are approximately connecting?
>>>>>
>>>>> have a look at
>>>>> http://zookeeper.apache.org/doc/r3.2.2/zookeeperAdmin.html#sc_bestPractices
>>>>>
>>>>> Tomas
>>>>>
>>>>> On 27 April 2015 at 10:58, Martin Stiborský <
>>>>> martin.stiborsky@gmail.com> wrote:
>>>>>
>>>>>> Hello guys,
>>>>>> we are running a mesos stack on CoreOS, with three zookeeper nodes.
>>>>>>
>>>>>> We can start a docker containers with Marathon and all, that's fine,
>>>>>> but some of the docker containers generates high network load, while
>>>>>> communicating between nodes/containers and I think that' the reason
why the
>>>>>> zookeper is failing.
>>>>>> From logs, I can see this error:
>>>>>>
>>>>>> Apr 27 05:06:15 epsp02.dc.vendavo.com systemd[1]: Stopping Zookeper
>>>>>> server...
>>>>>> Apr 27 05:06:45 epsp02.dc.vendavo.com docker[1155]: 2015-04-27
>>>>>> 05:06:45,705 [myid:1] - WARN  [NIOServerCxn.Factory:
>>>>>> 0.0.0.0/0.0.0.0:2181:NIOServerCnxn@357] - caught end of stream
>>>>>> exception
>>>>>> Apr 27 05:06:45 epsp02.dc.vendavo.com docker[1155]:
>>>>>> EndOfStreamException: Unable to read additional data from client
sessionid
>>>>>> 0x14cf73508730003, likely client has closed socket
>>>>>> Apr 27 05:06:45 epsp02.dc.vendavo.com docker[1155]: at
>>>>>> org.apache.zookeeper.server.NIOServerCnxn.doIO(NIOServerCnxn.java:228)
>>>>>> Apr 27 05:06:45 epsp02.dc.vendavo.com docker[1155]: at
>>>>>> org.apache.zookeeper.server.NIOServerCnxnFactory.run(NIOServerCnxnFactory.java:208)
>>>>>> Apr 27 05:06:45 epsp02.dc.vendavo.com docker[1155]: at
>>>>>> java.lang.Thread.run(Thread.java:745)
>>>>>> Apr 27 05:06:45 epsp02.dc.vendavo.com docker[1155]: 2015-04-27
>>>>>> 05:06:45,707 [myid:1] - INFO  [NIOServerCxn.Factory:
>>>>>> 0.0.0.0/0.0.0.0:2181:NIOServerCnxn@1007] - Closed socket connec
>>>>>> tion for client /10.60.11.82:58082 which had sessionid
>>>>>> 0x14cf73508730003
>>>>>>
>>>>>> And then all ZK nodes goes down…mesos fails as well and that's
it.
>>>>>> The cluster eventually do recover, but the tasks running are gone,
not
>>>>>> finished.
>>>>>>
>>>>>> I have to say I don't have a proper monitoring in place yet, working
>>>>>> on it right now, so I can't rely on real data to prove this assumption,
but
>>>>>> it's my guess.
>>>>>> So if you can confirm that this makes sense, or share with me your
>>>>>> experiences, that would be pretty valuable for me right now.
>>>>>>
>>>>>> Thanks a lot!
>>>>>>
>>>>>
>>>>>
>

Mime
View raw message