mesos-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Jian Qiu (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (MESOS-4679) slave dies unexpectedly: Mismatched checkpoint value for status update TASK_LOST
Date Tue, 25 Oct 2016 08:32:59 GMT

    [ https://issues.apache.org/jira/browse/MESOS-4679?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15604648#comment-15604648
] 

Jian Qiu commented on MESOS-4679:
---------------------------------

It still seems to be a issue in 1.0.0 when using k8s on mesos.

 

> slave dies unexpectedly: Mismatched checkpoint value for status update TASK_LOST
> --------------------------------------------------------------------------------
>
>                 Key: MESOS-4679
>                 URL: https://issues.apache.org/jira/browse/MESOS-4679
>             Project: Mesos
>          Issue Type: Bug
>          Components: slave
>    Affects Versions: 0.26.0
>            Reporter: James DeFelice
>              Labels: mesosphere
>
> It looks like the custom executor is sending out multiple terminal status updates for
a specific task and that's crashing the slave (as well as possibly mishandling status-update
UUID's?). In any event, I think that the slave should handle this case with a bit more aplomb.
> Custom executor logs:
> {code}
> I0215 20:43:59.551657   11068 executor.go:426] Executor driver killTask
> I0215 20:43:59.551719   11068 executor.go:436] Executor driver is asked to kill task
'&TaskID{Value:*pod.1e4f9fbe-d1db-11e5-8a9a-525400309a8f,XXX_unrecognized:[],}'
> I0215 20:43:59.552189   11068 executor.go:687] Executor sending status update &StatusUpdate{FrameworkId:&FrameworkID{Value:*df95a79b-d6d4-4b96-853e-55686628e898-0006,XXX_unrecognized:[],},ExecutorId:&ExecutorID{Value:*31df9d040f057abd_k8sm-executor,XXX_unrecognized:[],},SlaveId:&SlaveID{Value:*20150628-154106-117441034-5050-1315-S2,XXX_unrecognized:[],},Status:&TaskStatus{TaskId:&TaskID{Value:*pod.1e4f9fbe-d1db-11e5-8a9a-525400309a8f,XXX_unrecognized:[],},State:*TASK_LOST,Data:nil,Message:*kill-pod-task,SlaveId:&SlaveID{Value:*20150628-154106-117441034-5050-1315-S2,XXX_unrecognized:[],},Timestamp:*1.455569039e+09,ExecutorId:nil,Healthy:nil,Source:nil,Reason:nil,Uuid:nil,Labels:nil,ContainerStatus:nil,XXX_unrecognized:[],},Timestamp:*1.455569039e+09,Uuid:*[214
253 145 223 212 36 17 229 158 224 82 84 0 231 66 70],LatestState:nil,XXX_unrecognized:[],}
> I0215 20:43:59.552599   11068 executor.go:687] Executor sending status update &StatusUpdate{FrameworkId:&FrameworkID{Value:*df95a79b-d6d4-4b96-853e-55686628e898-0006,XXX_unrecognized:[],},ExecutorId:&ExecutorID{Value:*31df9d040f057abd_k8sm-executor,XXX_unrecognized:[],},SlaveId:&SlaveID{Value:*20150628-154106-117441034-5050-1315-S2,XXX_unrecognized:[],},Status:&TaskStatus{TaskId:&TaskID{Value:*pod.1e4f9fbe-d1db-11e5-8a9a-525400309a8f,XXX_unrecognized:[],},State:*TASK_KILLED,Data:nil,Message:*pod-deleted,SlaveId:&SlaveID{Value:*20150628-154106-117441034-5050-1315-S2,XXX_unrecognized:[],},Timestamp:*1.455569039e+09,ExecutorId:nil,Healthy:nil,Source:nil,Reason:nil,Uuid:nil,Labels:nil,ContainerStatus:nil,XXX_unrecognized:[],},Timestamp:*1.455569039e+09,Uuid:*[214
253 162 110 212 36 17 229 158 224 82 84 0 231 66 70],LatestState:nil,XXX_unrecognized:[],}
> I0215 20:43:59.557376   11068 suicide.go:51] stopping suicide watch
> I0215 20:43:59.559077   11068 executor.go:445] Executor statusUpdateAcknowledgement
> I0215 20:43:59.559129   11068 executor.go:448] Receiving status update acknowledgement
&StatusUpdateAcknowledgementMessage{SlaveId:&SlaveID{Value:*20150628-154106-117441034-5050-1315-S2,XXX_unrecognized:[],},FrameworkId:&FrameworkID{Value:*df95a79b-d6d4-4b96-853e-55686628e898-0006,XXX_unrecognized:[],},TaskId:&TaskID{Value:*pod.1e4f9fbe-d1db-11e5-8a9a-525400309a8f,XXX_unrecognized:[],},Uuid:*[214
253 145 223 212 36 17 229 158 224 82 84 0 231 66 70],XXX_unrecognized:[],}
> I0215 20:43:59.562016   11068 executor.go:470] Executor driver received frameworkMessage
> I0215 20:43:59.562073   11068 executor.go:480] Executor driver receives framework message
> I0215 20:43:59.562100   11068 executor.go:445] Executor statusUpdateAcknowledgement
> I0215 20:43:59.562112   11068 executor.go:448] Receiving status update acknowledgement
&StatusUpdateAcknowledgementMessage{SlaveId:&SlaveID{Value:*20150628-154106-117441034-5050-1315-S2,XXX_unrecognized:[],},FrameworkId:&FrameworkID{Value:*df95a79b-d6d4-4b96-853e-55686628e898-0006,XXX_unrecognized:[],},TaskId:&TaskID{Value:*pod.1e4f9fbe-d1db-11e5-8a9a-525400309a8f,XXX_unrecognized:[],},Uuid:*[214
253 162 110 212 36 17 229 158 224 82 84 0 231 66 70],XXX_unrecognized:[],}
> I0215 20:43:59.562173   11068 executor.go:579] Receives message from framework task-lost:pod.1e4f9fbe-d1db-11e5-8a9a-525400309a8f
> I0215 20:43:59.562292   11068 executor.go:687] Executor sending status update &StatusUpdate{FrameworkId:&FrameworkID{Value:*df95a79b-d6d4-4b96-853e-55686628e898-0006,XXX_unrecognized:[],},ExecutorId:&ExecutorID{Value:*31df9d040f057abd_k8sm-executor,XXX_unrecognized:[],},SlaveId:&SlaveID{Value:*20150628-154106-117441034-5050-1315-S2,XXX_unrecognized:[],},Status:&TaskStatus{TaskId:&TaskID{Value:*pod.1e4f9fbe-d1db-11e5-8a9a-525400309a8f,XXX_unrecognized:[],},State:*TASK_LOST,Data:nil,Message:*task-lost-ack,SlaveId:&SlaveID{Value:*20150628-154106-117441034-5050-1315-S2,XXX_unrecognized:[],},Timestamp:*1.455569039e+09,ExecutorId:nil,Healthy:nil,Source:nil,Reason:nil,Uuid:nil,Labels:nil,ContainerStatus:nil,XXX_unrecognized:[],},Timestamp:*1.455569039e+09,Uuid:*[214
255 28 217 212 36 17 229 158 224 82 84 0 231 66 70],LatestState:nil,XXX_unrecognized:[],}
> I0215 20:43:59.562463   11068 executor.go:687] Executor sending status update &StatusUpdate{FrameworkId:&FrameworkID{Value:*df95a79b-d6d4-4b96-853e-55686628e898-0006,XXX_unrecognized:[],},ExecutorId:&ExecutorID{Value:*31df9d040f057abd_k8sm-executor,XXX_unrecognized:[],},SlaveId:&SlaveID{Value:*20150628-154106-117441034-5050-1315-S2,XXX_unrecognized:[],},Status:&TaskStatus{TaskId:&TaskID{Value:*pod.1e4f9fbe-d1db-11e5-8a9a-525400309a8f,XXX_unrecognized:[],},State:*TASK_LOST,Data:nil,Message:*kill-pod-task,SlaveId:&SlaveID{Value:*20150628-154106-117441034-5050-1315-S2,XXX_unrecognized:[],},Timestamp:*1.455569039e+09,ExecutorId:nil,Healthy:nil,Source:nil,Reason:nil,Uuid:nil,Labels:nil,ContainerStatus:nil,XXX_unrecognized:[],},Timestamp:*1.455569039e+09,Uuid:*[214
255 35 27 212 36 17 229 158 224 82 84 0 231 66 70],LatestState:nil,XXX_unrecognized:[],}
> I0215 20:43:59.568237   11068 executor.go:445] Executor statusUpdateAcknowledgement
> I0215 20:43:59.568286   11068 executor.go:448] Receiving status update acknowledgement
&StatusUpdateAcknowledgementMessage{SlaveId:&SlaveID{Value:*20150628-154106-117441034-5050-1315-S2,XXX_unrecognized:[],},FrameworkId:&FrameworkID{Value:*df95a79b-d6d4-4b96-853e-55686628e898-0006,XXX_unrecognized:[],},TaskId:&TaskID{Value:*pod.1e4f9fbe-d1db-11e5-8a9a-525400309a8f,XXX_unrecognized:[],},Uuid:*[214
255 28 217 212 36 17 229 158 224 82 84 0 231 66 70],XXX_unrecognized:[],}
> I0215 20:43:59.588373   11068 suicide.go:51] stopping suicide watch
> I0215 20:43:59.588566   11068 executor.go:687] Executor sending status update &StatusUpdate{FrameworkId:&FrameworkID{Value:*df95a79b-d6d4-4b96-853e-55686628e898-0006,XXX_unrecognized:[],},ExecutorId:&ExecutorID{Value:*31df9d040f057abd_k8sm-executor,XXX_unrecognized:[],},SlaveId:&SlaveID{Value:*20150628-154106-117441034-5050-1315-S2,XXX_unrecognized:[],},Status:&TaskStatus{TaskId:&TaskID{Value:*pod.6ce1b7db-d1db-11e5-8a9a-525400309a8f,XXX_unrecognized:[],},State:*TASK_KILLED,Data:nil,Message:*pod-deleted,SlaveId:&SlaveID{Value:*20150628-154106-117441034-5050-1315-S2,XXX_unrecognized:[],},Timestamp:*1.455569039e+09,ExecutorId:nil,Healthy:nil,Source:nil,Reason:nil,Uuid:nil,Labels:nil,ContainerStatus:nil,XXX_unrecognized:[],},Timestamp:*1.455569039e+09,Uuid:*[215
3 30 254 212 36 17 229 158 224 82 84 0 231 66 70],LatestState:nil,XXX_unrecognized:[],}
> I0215 20:43:59.595983   11068 executor.go:260] slave disconnected, will wait for recovery
> I0215 20:43:59.596040   11068 executor.go:328] Slave is disconnected
> I0215 20:43:59.623678   11068 suicide.go:51] stopping suicide watch
> I0215 20:43:59.623841   11068 executor.go:687] Executor sending status update &StatusUpdate{FrameworkId:&FrameworkID{Value:*df95a79b-d6d4-4b96-853e-55686628e898-0006,XXX_unrecognized:[],},ExecutorId:&ExecutorID{Value:*31df9d040f057abd_k8sm-executor,XXX_unrecognized:[],},SlaveId:&SlaveID{Value:*20150628-154106-117441034-5050-1315-S2,XXX_unrecognized:[],},Status:&TaskStatus{TaskId:&TaskID{Value:*pod.6d006a26-d1db-11e5-8a9a-525400309a8f,XXX_unrecognized:[],},State:*TASK_KILLED,Data:nil,Message:*pod-deleted,SlaveId:&SlaveID{Value:*20150628-154106-117441034-5050-1315-S2,XXX_unrecognized:[],},Timestamp:*1.455569039e+09,ExecutorId:nil,Healthy:nil,Source:nil,Reason:nil,Uuid:nil,Labels:nil,ContainerStatus:nil,XXX_unrecognized:[],},Timestamp:*1.455569039e+09,Uuid:*[215
8 128 159 212 36 17 229 158 224 82 84 0 231 66 70],LatestState:nil,XXX_unrecognized:[],}
> I0215 20:43:59.624399   11068 executor.go:284] slave exited ... shutting down
> I0215 20:43:59.624442   11068 executor.go:613] Aborting the executor driver
> {code}
> Slave logs:
> {code}
> I0215 20:43:59.564084 15780 slave.cpp:2762] Handling status update TASK_LOST (UUID: d6ff231b-d424-11e5-9ee0-525400e74246)
for task pod.1e4f9fbe-d1db-11e5-8a9a-525400309a8f of framework df95a79b-d6d4-4b96-853e-55686628e898-0006
from executor(1)@10.2.0.6:40672
> W0215 20:43:59.564115 15780 slave.cpp:2856] Could not find the executor for status update
TASK_LOST (UUID: d6ff231b-d424-11e5-9ee0-525400e74246) for task pod.1e4f9fbe-d1db-11e5-8a9a-525400309a8f
of framework df95a79b-d6d4-4b96-853e-55686628e898-0006
> I0215 20:43:59.564321 15782 status_update_manager.cpp:826] Checkpointing UPDATE for status
update TASK_LOST (UUID: d6ff1cd9-d424-11e5-9ee0-525400e74246) for task pod.1e4f9fbe-d1db-11e5-8a9a-525400309a8f
of framework df95a79b-d6d4-4b96-853e-55686628e898-0006
> I0215 20:43:59.566783 15782 status_update_manager.cpp:322] Received status update TASK_LOST
(UUID: d6ff231b-d424-11e5-9ee0-525400e74246) for task pod.1e4f9fbe-d1db-11e5-8a9a-525400309a8f
of framework df95a79b-d6d4-4b96-853e-55686628e898-0006
> I0215 20:43:59.566879 15782 slave.cpp:3087] Forwarding the update TASK_LOST (UUID: d6ff1cd9-d424-11e5-9ee0-525400e74246)
for task pod.1e4f9fbe-d1db-11e5-8a9a-525400309a8f of framework df95a79b-d6d4-4b96-853e-55686628e898-0006
to master@10.2.0.5:5050
> I0215 20:43:59.566952 15782 slave.cpp:3011] Sending acknowledgement for status update
TASK_LOST (UUID: d6ff1cd9-d424-11e5-9ee0-525400e74246) for task pod.1e4f9fbe-d1db-11e5-8a9a-525400309a8f
of framework df95a79b-d6d4-4b96-853e-55686628e898-0006 to executor(1)@10.2.0.6:40672
> F0215 20:43:59.567073 15782 slave.cpp:3003] CHECK_READY(future): is FAILED: Mismatched
checkpoint value for status update TASK_LOST (UUID: d6ff231b-d424-11e5-9ee0-525400e74246)
for task pod.1e4f9fbe-d1db-11e5-8a9a-525400309a8f of framework df95a79b-d6d4-4b96-853e-55686628e898-0006
(expected checkpoint=true actual checkpoint=false) Failed to handle status update TASK_LOST
(UUID: d6ff231b-d424-11e5-9ee0-525400e74246) for task pod.1e4f9fbe-d1db-11e5-8a9a-525400309a8f
of framework df95a79b-d6d4-4b96-853e-55686628e898-0006
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message