From issues-return-52491-archive-asf-public=cust-asf.ponee.io@mesos.apache.org Fri Jan 24 18:28:03 2020 Return-Path: X-Original-To: archive-asf-public@cust-asf.ponee.io Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mail.apache.org (hermes.apache.org [207.244.88.153]) by mx-eu-01.ponee.io (Postfix) with SMTP id 80C9D18064E for ; Fri, 24 Jan 2020 19:28:03 +0100 (CET) Received: (qmail 93811 invoked by uid 500); 24 Jan 2020 18:28:02 -0000 Mailing-List: contact issues-help@mesos.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@mesos.apache.org Delivered-To: mailing list issues@mesos.apache.org Received: (qmail 93802 invoked by uid 99); 24 Jan 2020 18:28:02 -0000 Received: from mailrelay1-us-west.apache.org (HELO mailrelay1-us-west.apache.org) (209.188.14.139) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 24 Jan 2020 18:28:02 +0000 Received: from jira-he-de.apache.org (static.172.67.40.188.clients.your-server.de [188.40.67.172]) by mailrelay1-us-west.apache.org (ASF Mail Server at mailrelay1-us-west.apache.org) with ESMTP id AB151E2E12 for ; Fri, 24 Jan 2020 18:28:01 +0000 (UTC) Received: from jira-he-de.apache.org (localhost.localdomain [127.0.0.1]) by jira-he-de.apache.org (ASF Mail Server at jira-he-de.apache.org) with ESMTP id 53A54780544 for ; Fri, 24 Jan 2020 18:28:00 +0000 (UTC) Date: Fri, 24 Jan 2020 18:28:00 +0000 (UTC) From: "Dalton Matos Coelho Barreto (Jira)" To: issues@mesos.apache.org Message-ID: In-Reply-To: References: Subject: [jira] [Commented] (MESOS-10068) Mesos Master doesn't send AGENT_REMOVED when removing agent from internal state MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: quoted-printable X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 [ https://issues.apache.org/jira/browse/MESOS-10068?page=3Dcom.atlassia= n.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=3D170= 23162#comment-17023162 ]=20 Dalton Matos Coelho Barreto commented on MESOS-10068: ----------------------------------------------------- Hello [~greggomann], Thanks for taking your time to answer this ticket. About the time to dedicate to fix this bug, I undestand. In fact I would li= ke to ask if you (and [~bmahler]=C2=A0or any others) are willing to mentor = a new developer into the world of the mesos project codebase. I studied the= code some time ago (because of the ticket MESOS-8517) but didn't manage to= contribute with any code at that time. =C2=A0 About the new ticket you created to fix what I reported here, do you think = it's better do close this ticket and mention it on the other (MESOS-10089)? =C2=A0 I'm already watching MESOS-9556 so if I have any new suggestion or the tick= et has any new information I will post there. =C2=A0 Thanks. > Mesos Master doesn't send AGENT_REMOVED when removing agent from internal= state > -------------------------------------------------------------------------= ------ > > Key: MESOS-10068 > URL: https://issues.apache.org/jira/browse/MESOS-10068 > Project: Mesos > Issue Type: Bug > Components: master > Affects Versions: 1.7.3, 1.8.2, 1.9.1 > Reporter: Dalton Matos Coelho Barreto > Priority: Major > Attachments: master-full-logs.log > > > Hello, > =C2=A0 > Looking at the documentation of the master {{/api/v1}} endpoint, the {{SU= BSCRIBE}} message says that only {{TASK_ADDED}} and {{TASK_UPDATED}} is sup= ported for this endpoint, but when a new agent joins the cluster a {{AGENT_= ADDED}} event is received. > The problem is that when this agent is stopped the {{AGENT_REMOVED}} is n= ot received by clients subscribed to the master API. > =C2=A0 > I testes this behavior with versions: {{1.7.3}}, {{1.8.2}} and {{1.9.1}}.= All using the docker image {{mesos/mesos-centos}}. > The only way I saw a {{AGENT_REMOVED}} event was when a new agent joined = the cluster but the master couldn't communicate with this agent, in this sp= ecific test there was a firewall blocking port {{5051}} on the slave, that = is, no body was being able to tal to the slave on port {{5051}}. > =C2=A0 > h2. Here are the steps do reproduce the problem > * Start a new mesos master > * Connect to the {{/api/v1}} endpoint, sendingo a {{SUBSCRIBE}} message: > **=20 > {noformat} > curl --no-buffer -Ld '{"type": "SUBSCRIBE"}' -H "Content-Type: applicatio= n/json" http://MASTER_IP:5050/api/v1{noformat} > * Start a new slave and confirm the {{AGENT_ADDED}}=C2=A0event is delive= red; > * Stop this slave; > * Checks that {{/slaves?slave_id=3DAGENT_ID}} returns a JSON response wi= th the field {{active=3Dfalse}}. > * Waits for mesos master stop listing this slave, that is, {{/slaves?sla= ve_id=3DAGENT_ID}} returns an empty response; > Even after the empty response, the event never reaches the subscriber. > =C2=A0 > The mesos master logs shows this: > {noformat} > I1213 15:03:10.338935 13 master.cpp:1297] Agent 2cd23025-c09d-401b-8f= 26-9265eda8f800-S1 at slave(1)@172.18.0.51:5051 (86813ca2a964) disconnected > I1213 15:03:10.339089 13 master.cpp:3399] Disconnecting agent 2cd23025= -c09d-401b-8f26-9265eda8f800-S1 at slave(1)@172.18.0.51:5051 (86813ca2a964) > I1213 15:03:10.339207 13 master.cpp:3418] Deactivating agent 2cd23025-= c09d-401b-8f26-9265eda8f800-S1 at slave(1)@172.18.0.51:5051 (86813ca2a964) > {noformat} > And then: > {noformat} > W1213 15:04:40.726670 15 process.cpp:1917] Failed to send 'mesos.inter= nal.PingSlaveMessage' to '172.18.0.51:5051', connect: Failed to connect to = 172.18.0.51:5051: No route to host{noformat} > And some time after this: > {noformat} > I1213 15:04:37.685007 7 hierarchical.cpp:900] Removed agent 2cd23025-= c09d-401b-8f26-9265eda8f800-S1 {noformat} > =C2=A0 > Even after this removal, the {{AGENT_REMOVED}} event is not delivered. > =C2=A0 > I will attach the full master logs also. > =C2=A0 > Do you think this could be a bug? -- This message was sent by Atlassian Jira (v8.3.4#803005)