mesos-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Greg Mann (JIRA)" <j...@apache.org>
Subject [jira] [Updated] (MESOS-5635) Agent repeatedly reregisters, possible one-way disconnection
Date Sat, 18 Jun 2016 00:39:05 GMT

     [ https://issues.apache.org/jira/browse/MESOS-5635?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Greg Mann updated MESOS-5635:
-----------------------------
    Description: 
This issue was observed recently on an internal test cluster. Due to a bug in the agent code
(MESOS-5629), regular segfaults were occurring on an agent. After one such failure, the agent
recovered and about a minute later the following was observed in the master logs:
{code}
I0617 22:23:41.663557  2014 master.cpp:4795] Re-registering agent 6d4248cd-2832-4152-b5d0-defbf36f6759-S3
at slave(1)@10.10.0.179:5051 (10.10.0.179)
{code}
However, we see nothing about registration in the agent logs at this time. Subsequently, in
the master logs, we see the agent continuing to reregister every couple seconds:
{code}
I0617 22:23:43.528590  2014 master.cpp:4795] Re-registering agent 6d4248cd-2832-4152-b5d0-defbf36f6759-S3
at slave(1)@10.10.0.179:5051 (10.10.0.179)
{code}
After about four minutes of this, we see:
{code}
I0617 22:27:43.994493  2014 master.cpp:6750] Removed agent 6d4248cd-2832-4152-b5d0-defbf36f6759-S3
(10.10.0.179): health check timed out
{code}
And after this point, we see repeated reregistration attempts from that agent in the master
logs:
{code}
W0617 22:29:09.514423  2010 master.cpp:4773] Agent 6d4248cd-2832-4152-b5d0-defbf36f6759-S3
at slave(1)@10.10.0.179:5051 (10.10.0.179) attempted to re-register after removal;
{code}

During all of this, however, the agent logs indicate nothing about registration. All we see
are requests coming in to {{/state}}:
{code}
Jun 17 22:26:37 ip-10-10-0-179 mesos-slave[868]: I0617 22:26:37.870980   873 http.cpp:192]
HTTP GET for /slave(1)/state from 10.10.0.181:38792 with User-Agent='Mozilla/5.0 (Macintosh;
Intel Mac OS X 10.10
Jun 17 22:26:38 ip-10-10-0-179 mesos-slave[868]: I0617 22:26:38.158476   879 http.cpp:192]
HTTP GET for /slave(1)/state from 10.10.0.179:41009
Jun 17 22:26:38 ip-10-10-0-179 mesos-slave[868]: I0617 22:26:38.884507   873 http.cpp:192]
HTTP GET for /slave(1)/state from 10.10.0.179:41009
Jun 17 22:26:39 ip-10-10-0-179 mesos-slave[868]: I0617 22:26:39.604486   876 http.cpp:192]
HTTP GET for /slave(1)/state from 10.10.0.179:41009
Jun 17 22:26:40 ip-10-10-0-179 mesos-slave[868]: I0617 22:26:40.018326   875 http.cpp:192]
HTTP GET for /slave(1)/state from 10.10.0.181:38803 with User-Agent='Mozilla/5.0 (Macintosh;
Intel Mac OS X 10.10
Jun 17 22:26:40 ip-10-10-0-179 mesos-slave[868]: I0617 22:26:40.329465   873 http.cpp:192]
HTTP GET for /slave(1)/state from 10.10.0.179:41009
{code}

The lack of logging on the agent side, and the health check timeout, suggests a one-way disconnection
such that the master cannot send messages to the agent, but the agent can send messages to
the master. This behavior has been observed several times on this test cluster in the past
couple days.

  was:
This issue was observed recently on an internal test cluster. Due to a bug in the agent code
(MESOS-5629), regular segfaults were occurring on an agent. While the agent was recovering
from one of these failures, it segfaulted again. After this time, we noticed that after beginning
recovery, the agent did not print {{Finished recovery}}, and its logs did not show any indication
of reregistering with the master. Looking at the master's logs, however, the following line
was observed repeatedly, at intervals on the order of seconds:
{code}
W0617 21:27:07.010679  2016 master.cpp:4773] Agent 2b899dd3-3b1f-4520-a6b2-98e32196f723-S4
at slave(1)@10.10.0.87:5051 (10.10.0.87) attempted to re-register after removal; shutting
it down
{code}
These re-registration attempts had no corresponding lines in the agent log.

Subsequently deleting the contents of the agent's {{work_dir}} and restarting it led to a
successful registration with a new agent ID:
{code}
I0617 21:29:01.246119  2011 master.cpp:4635] Registering agent at slave(1)@10.10.0.87:5051
(10.10.0.87) with id 2b899dd3-3b1f-4520-a6b2-98e32196f723-S5
{code}


> Agent repeatedly reregisters, possible one-way disconnection
> ------------------------------------------------------------
>
>                 Key: MESOS-5635
>                 URL: https://issues.apache.org/jira/browse/MESOS-5635
>             Project: Mesos
>          Issue Type: Bug
>            Reporter: Greg Mann
>              Labels: agent, mesosphere
>
> This issue was observed recently on an internal test cluster. Due to a bug in the agent
code (MESOS-5629), regular segfaults were occurring on an agent. After one such failure, the
agent recovered and about a minute later the following was observed in the master logs:
> {code}
> I0617 22:23:41.663557  2014 master.cpp:4795] Re-registering agent 6d4248cd-2832-4152-b5d0-defbf36f6759-S3
at slave(1)@10.10.0.179:5051 (10.10.0.179)
> {code}
> However, we see nothing about registration in the agent logs at this time. Subsequently,
in the master logs, we see the agent continuing to reregister every couple seconds:
> {code}
> I0617 22:23:43.528590  2014 master.cpp:4795] Re-registering agent 6d4248cd-2832-4152-b5d0-defbf36f6759-S3
at slave(1)@10.10.0.179:5051 (10.10.0.179)
> {code}
> After about four minutes of this, we see:
> {code}
> I0617 22:27:43.994493  2014 master.cpp:6750] Removed agent 6d4248cd-2832-4152-b5d0-defbf36f6759-S3
(10.10.0.179): health check timed out
> {code}
> And after this point, we see repeated reregistration attempts from that agent in the
master logs:
> {code}
> W0617 22:29:09.514423  2010 master.cpp:4773] Agent 6d4248cd-2832-4152-b5d0-defbf36f6759-S3
at slave(1)@10.10.0.179:5051 (10.10.0.179) attempted to re-register after removal;
> {code}
> During all of this, however, the agent logs indicate nothing about registration. All
we see are requests coming in to {{/state}}:
> {code}
> Jun 17 22:26:37 ip-10-10-0-179 mesos-slave[868]: I0617 22:26:37.870980   873 http.cpp:192]
HTTP GET for /slave(1)/state from 10.10.0.181:38792 with User-Agent='Mozilla/5.0 (Macintosh;
Intel Mac OS X 10.10
> Jun 17 22:26:38 ip-10-10-0-179 mesos-slave[868]: I0617 22:26:38.158476   879 http.cpp:192]
HTTP GET for /slave(1)/state from 10.10.0.179:41009
> Jun 17 22:26:38 ip-10-10-0-179 mesos-slave[868]: I0617 22:26:38.884507   873 http.cpp:192]
HTTP GET for /slave(1)/state from 10.10.0.179:41009
> Jun 17 22:26:39 ip-10-10-0-179 mesos-slave[868]: I0617 22:26:39.604486   876 http.cpp:192]
HTTP GET for /slave(1)/state from 10.10.0.179:41009
> Jun 17 22:26:40 ip-10-10-0-179 mesos-slave[868]: I0617 22:26:40.018326   875 http.cpp:192]
HTTP GET for /slave(1)/state from 10.10.0.181:38803 with User-Agent='Mozilla/5.0 (Macintosh;
Intel Mac OS X 10.10
> Jun 17 22:26:40 ip-10-10-0-179 mesos-slave[868]: I0617 22:26:40.329465   873 http.cpp:192]
HTTP GET for /slave(1)/state from 10.10.0.179:41009
> {code}
> The lack of logging on the agent side, and the health check timeout, suggests a one-way
disconnection such that the master cannot send messages to the agent, but the agent can send
messages to the master. This behavior has been observed several times on this test cluster
in the past couple days.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message