mesos-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Vinod Kone <vinodk...@gmail.com>
Subject Re: Framework stops to receive the heartbeats and events and gets removed from master
Date Mon, 23 Jan 2017 18:00:52 GMT
No problem. Glad you figured out. 

@vinodkone

> On Jan 23, 2017, at 8:38 AM, Vova Shelgunov <vvshvv@gmail.com> wrote:
> 
> Yes, it works. Sorry for troubling, the first time when I looked at the logs I did not
notice that failover_timeout is zero.
> 
> 2017-01-23 19:27 GMT+03:00 Vova Shelgunov <vvshvv@gmail.com>:
>> Logs from mesos master:
>> 
>> 0123 15:53:44.523613     7 http.cpp:391] HTTP POST for /master/api/v1/scheduler from
172.18.0.1:58864 with User-Agent='AHC/2.0'
>> I0123 15:53:44.524159     7 master.cpp:4827] Processing ACKNOWLEDGE call ac9a6e5e-67b3-490a-930f-0024eab734b4
for task 10336 of framework 3edce0a6-2a9e-448f-a5c2-666e2c2c3086-0005 (Test HTTP Framework)
on agent 16c100c1-13fe-47b8-a2a0-aed9bafbbf8c-S0
>> I0123 15:53:44.524849     7 master.cpp:7744] Removing task 10336 with resources cpus(*):0.1;
mem(*):32 of framework 3edce0a6-2a9e-448f-a5c2-666e2c2c3086-0005 on agent 16c100c1-13fe-47b8-a2a0-aed9bafbbf8c-S0
at slave(1)@172.18.0.3:5051 (mesos-slave)
>> I0123 15:53:44.529033     7 master.cpp:1297] Framework 3edce0a6-2a9e-448f-a5c2-666e2c2c3086-0005
(Test HTTP Framework) disconnected
>> I0123 15:53:44.529636     7 master.cpp:2902] Disconnecting framework 3edce0a6-2a9e-448f-a5c2-666e2c2c3086-0005
(Test HTTP Framework)
>> I0123 15:53:44.529974     7 master.cpp:2926] Deactivating framework 3edce0a6-2a9e-448f-a5c2-666e2c2c3086-0005
(Test HTTP Framework)
>> I0123 15:53:44.530299     7 master.cpp:1310] Giving framework 3edce0a6-2a9e-448f-a5c2-666e2c2c3086-0005
(Test HTTP Framework) 0ns to failover
>> I0123 15:53:44.530594     7 hierarchical.cpp:386] Deactivated framework 3edce0a6-2a9e-448f-a5c2-666e2c2c3086-0005
>> I0123 15:53:44.531962     7 master.cpp:6369] Framework failover timeout, removing
framework 3edce0a6-2a9e-448f-a5c2-666e2c2c3086-0005 (Test HTif TP Framework)
>> I0123 15:53:44.534992     7 master.cpp:7103] Removing framework 3edce0a6-2a9e-448f-a5c2-666e2c2c3086-0005
(Test HTTP Framework)
>> 
>> It seems failover timeout is set to zero for the framework.
>> 
>> It can be my coding error if framework looses its connection to the master multiple
times (I see that I do not pass failover_timeout value during reconnection).
>> I will try to observe if it solves my issue.
>> 
>> Thanks
>> 
>> 2017-01-23 19:05 GMT+03:00 Vova Shelgunov <vvshvv@gmail.com>:
>>> Hi,
>>> 
>>> I faced a very strange situation with my framework that talks to mesos master
via Scheduler HTTP API:
>>> 
>>> Sometimes my framework stops to receive the heartbeats and task updates from
a master.
>>> I read the documentation of mesos (http://mesos.apache.org/documentation/latest/scheduler-http-api/),
Network partitions section and I see that if a framework does not receive the heartbeats within
some time it should reconnect to the master.
>>> 
>>> I have written a heartbeat monitor that checks if there were not heartbeats last
n seconds, then reconnect, but after the reconnection, I all the time receive an ERROR from
the mesos master that my framework has been removed.
>>> 
>>> Why is it happening?
>>> 
>>> Regards,
>>> Uladzimir
>> 
> 

Mime
View raw message