mesos-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Greg Mann <g...@mesosphere.io>
Subject Re: Mesos master endless attemps to kill unexisting task
Date Wed, 04 Apr 2018 17:30:34 GMT
Hi Adam,
The fact that the task does not show up in the Mesos UI doesn't make sense
to me, in light of the logs excerpts you included. The line:

Mar 14 09:56:49 mario mesos-master[23570]: I0314 09:56:49.441658 23602
master.cpp:5371] Telling agent 2215ab84-177b-478b-ab62-4453803fde6c-S6 at
slave(1)@10.99.50.3:5051 (zelda.service.domain.com) to kill task
pub_api_oecd-rest-api-on-port-20015.196f414a-f61f-11e7-856c-f6e84742f1ef of
framework 346d7333-a980-43a8-93ab-343ea12d77d7-0000 (marathon) at
scheduler-66a67553-0692-40b0-b29e-e7f342b6a241@10.99.50.2:40487

indicates that the Mesos master was able to locate this task in its
internal state. So, I would expect the task to show up in the Mesos UI. You
could also look for the task in the output of the GET_TASKS operator API
call for the master
<http://mesos.apache.org/documentation/latest/operator-http-api/#get_tasks>
and the agent
<http://mesos.apache.org/documentation/latest/operator-http-api/#get_tasks-1>
.

Have you looked at the Mesos agent logs to see how the agent is responding
to the KILL calls?

Mesos doesn't store any state in ZK (it's only used for leader election),
so clearing the task there is not an option. It's possible that forcing a
leader election by restarting the current Mesos master may help, but I'm
uncertain what state the master is in currently, given the inconsistency
noted above.

Cheers,
Greg


On Wed, Apr 4, 2018 at 1:09 AM, Adam Cecile <adam.cecile@hitec.lu> wrote:

> For instance,
>
> No kill ack received for instance [pub_api_oecd-rest-api-on-
> port-20015.marathon-196f414a-f61f-11e7-856c-f6e84742f1ef], retrying
> (73402 attempts so far)
>
> I'd say after 73402 attempts, it's time to let it go :D
>
> On 04/04/2018 10:07 AM, Adam Cecile wrote:
>
> Hello list !
>
> Problem is still on-going, any hint how to fix that ? Like removing broken
> app from zookeeper by hand ?
>
> Regards, Adam.
>
> On 03/20/2018 06:04 PM, daemeon reiydelle wrote:
>
> I ran across a situation with the same symptoms last year (with Mesos &
> Marathon) when we had network problems. The mesos task did exit normally
> (eventually found same in the logs), therefore the UUID had aged out.
>
>
> <======>
> "Who do you think made the first stone spear? The Asperger guy.
> If you get rid of the autism genetics, there would be no Silicon Valley"
> Temple Grandin
>
>
> *Daemeon C.M. Reiydelle San Francisco 1.415.501.0198 London 44 020 8144
> 9872*
>
>
> On Tue, Mar 20, 2018 at 1:34 AM, Adam Cecile <adam.cecile@hitec.lu> wrote:
>
>> Hi Greg,
>>
>> Yes I can confirm No kill ack received for instance
>> [pub_api_oecd-rest-api-on-port-20015.marathon-196f414a-f61f-11e7-856c-f6e84742f1ef],
>> retrying (73402 attempts so far)i cannot find this UUID in Mesos interface.
>>
>> Regards, Adam.
>>
>> On 03/15/2018 05:47 PM, Greg Mann wrote:
>>
>> Hi Adam,
>> The KILL calls are being sent to Mesos by Marathon. Since the KILL call
>> is being forwarded to the agent, it seems that the Mesos master is aware of
>> the task. Could you verify that the tasks show up as running in the Mesos
>> UI? You say that the tasks don't exist anymore - how did you verify this?
>> If the tasks show up as running in the Mesos state, but the actual task
>> processes are not running on the agent, then it could indicate an issue
>> with the Mesos agent or executor.
>>
>> Cheers,
>> Greg
>>
>>
>> On Wed, Mar 14, 2018 at 1:59 AM, Adam Cecile <adam.cecile@hitec.lu>
>> wrote:
>>
>>> Hello,
>>>
>>> I see two old tasks being stuck in Mesos. These tasks don't exist
>>> anymore since ages but Mesos still tries to kill them:
>>>
>>>
>>> Mar 14 09:56:49 mario mesos-master[23570]: I0314 09:56:49.441572 23602
>>> master.cpp:5297] Processing KILL call for task
>>> 'pub_api_oecd-rest-api-on-port-20015.196f414a-f61f-11e7-856c-f6e84742f1ef'
>>> of framework 346d7333-a980-43a8-93ab-343ea12d77d7-0000 (marathon) at
>>> scheduler-66a67553-0692-40b0-b29e-e7f342b6a241@10.99.50.2:40487
>>>
>>> Mar 14 09:56:49 mario mesos-master[23570]: I0314 09:56:49.441658 23602
>>> master.cpp:5371] Telling agent 2215ab84-177b-478b-ab62-4453803fde6c-S6
>>> at slave(1)@10.99.50.3:5051 (zelda.service.domain.com) to kill task
>>> pub_api_oecd-rest-api-on-port-20015.196f414a-f61f-11e7-856c-f6e84742f1ef
>>> of framework 346d7333-a980-43a8-93ab-343ea12d77d7-0000 (marathon) at
>>> scheduler-66a67553-0692-40b0-b29e-e7f342b6a241@10.99.50.2:40487
>>>
>>> Mar 14 09:57:09 mario mesos-master[23570]: I0314 09:57:09.441529 23607
>>> master.cpp:5297] Processing KILL call for task
>>> 'pub_api_oecd-rest-api-on-port-20015.196f414a-f61f-11e7-856c-f6e84742f1ef'
>>> of framework 346d7333-a980-43a8-93ab-343ea12d77d7-0000 (marathon) at
>>> scheduler-66a67553-0692-40b0-b29e-e7f342b6a241@10.99.50.2:40487
>>>
>>> Mar 14 09:57:09 mario mesos-master[23570]: I0314 09:57:09.441617 23607
>>> master.cpp:5371] Telling agent 2215ab84-177b-478b-ab62-4453803fde6c-S6
>>> at slave(1)@10.99.50.3:5051 (zelda.service.domain.com) to kill task
>>> pub_api_oecd-rest-api-on-port-20015.196f414a-f61f-11e7-856c-f6e84742f1ef
>>> of framework 346d7333-a980-43a8-93ab-343ea12d77d7-0000 (marathon) at
>>> scheduler-66a67553-0692-40b0-b29e-e7f342b6a241@10.99.50.2:40487
>>>
>>>
>>> Could you please tell me how to "purge" them from Mesos master ?
>>>
>>> Thanks in advance,
>>>
>>> Adam.
>>>
>>
>>
>>
>
>
>

Mime
View raw message