mesos-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Vinod Kone <vinodk...@gmail.com>
Subject Re: mesos slave sometimes does not send launchTask to executor?
Date Wed, 23 Jul 2014 19:43:39 GMT
There were no bugs/fixes related to slave<->executor message passing in
0.19.1. So, I'm surprised you weren't able to repro this behavior in 0.19.1.

Looks like you are using pure scheduler/executor clients to talk to mesos?
If yes, there might be a bug in your executor? Although, if it's in your
executor that should be reproed in 0.19.1 as well.

I would recommend running the default command executor to see if you are
having issues with message passing in 0.19.0. If no issues, I would dig
more closely into your executor client.


On Wed, Jul 23, 2014 at 12:17 PM, Henning Schmiedehausen <
henning@schmiedehausen.org> wrote:

> Hi,
>
> so I have that strange behavior that once in a blue moon, the mesos
> executor decides to not send a "launchTask" message to my executor.
>
> This is what I see (mesos-slave.INFO):
>
> I0722 02:09:34.524090  6769 slave.cpp:933] Got assigned task
> 83ea269c-8988-49bd-9d23-034c33858352 for framework
> 20140711-183251-3078776074-5050-5240-0083
> I0722 02:09:34.524222  6769 slave.cpp:1043] Launching task
> 83ea269c-8988-49bd-9d23-034c33858352 for framework
> 20140711-183251-3078776074-5050-5240-0083
> I0722 02:09:34.524915  6769 slave.cpp:1153] Queuing task
> '83ea269c-8988-49bd-9d23-034c33858352' for executor
> candyland_e3030c10-d154-4a34-a72f-aba07e1a84d4 of framework
> '20140711-183251-3078776074-5050-5240-0083
> I0722 02:09:34.524968  6772 mesos_containerizer.cpp:537] Starting
> container 'f4d1c44b-63e4-4409-be6e-6d5983e01d8c' for executor
> 'candyland_e3030c10-d154-4a34-a72f-aba07e1a84d4' of framework
> '20140711-183251-3078776074-5050-5240-0083'
> I0722 02:09:34.527037  6772 launcher.cpp:117] Forked child with pid
> '428' for container 'f4d1c44b-63e4-4409-be6e-6d5983e01d8c'
> I0722 02:09:34.527452  6772 mesos_containerizer.cpp:647] Fetching URIs
> for container 'f4d1c44b-63e4-4409-be6e-6d5983e01d8c' using command
> '/usr/local/libexec/mesos/mesos-fetcher'
> I0722 02:09:35.237933  6767 slave.cpp:2355] Monitoring executor
> 'candyland_e3030c10-d154-4a34-a72f-aba07e1a84d4' of framework
> '20140711-183251-3078776074-5050-5240-0083' in container
> 'f4d1c44b-63e4-4409-be6e-6d5983e01d8c'
> I0722 02:09:36.131356  6767 slave.cpp:1664] Got registration for
> executor 'candyland_e3030c10-d154-4a34-a72f-aba07e1a84d4' of framework
> 20140711-183251-3078776074-5050-5240-0083
> I0722 02:09:36.131597  6767 slave.cpp:1783] Flushing queued task
> 83ea269c-8988-49bd-9d23-034c33858352 for executor
> 'candyland_e3030c10-d154-4a34-a72f-aba07e1a84d4' of framework
> 20140711-183251-3078776074-5050-5240-0083
>
>
> For a working task, it looks like this:
>
> I0722 02:08:34.484169  6773 slave.cpp:933] Got assigned task
> c448be94-cd7f-4623-8348-21cf95651ee8 for framework
> 20140711-183251-3078776074-5050-5240-0083
> I0722 02:08:34.484282  6773 slave.cpp:1043] Launching task
> c448be94-cd7f-4623-8348-21cf95651ee8 for framework
> 20140711-183251-3078776074-5050-5240-0083
> I0722 02:08:34.486240  6774 mesos_containerizer.cpp:537] Starting
> container '1db76576-e0fb-42b5-b9c7-3877e79e35d5' for executor
> 'candyland_e8741342-35af-407b-9986-fef48f432ae8' of framework
> '20140711-183251-3078776074-5050-5240-0083'
> I0722 02:08:34.487082  6774 launcher.cpp:117] Forked child with pid
> '32644' for container '1db76576-e0fb-42b5-b9c7-3877e79e35d5'
> I0722 02:08:34.487540  6774 mesos_containerizer.cpp:647] Fetching URIs
> for container '1db76576-e0fb-42b5-b9c7-3877e79e35d5' using command
> '/usr/local/libexec/mesos/mesos-fetcher'
> I0722 02:08:34.489589  6773 slave.cpp:1153] Queuing task
> 'c448be94-cd7f-4623-8348-21cf95651ee8' for executor
> candyland_e8741342-35af-407b-9986-fef48f432ae8 of framework
> '20140711-183251-3078776074-5050-5240-0083
> I0722 02:08:35.193210  6771 slave.cpp:2355] Monitoring executor
> 'candyland_e8741342-35af-407b-9986-fef48f432ae8' of framework
> '20140711-183251-3078776074-5050-5240-0083' in container
> '1db76576-e0fb-42b5-b9c7-3877e79e35d5'
> I0722 02:08:35.970985  6768 slave.cpp:1664] Got registration for
> executor 'candyland_e8741342-35af-407b-9986-fef48f432ae8' of framework
> 20140711-183251-3078776074-5050-5240-0083
> I0722 02:08:35.971294  6768 slave.cpp:1783] Flushing queued task
> c448be94-cd7f-4623-8348-21cf95651ee8 for executor
> 'candyland_e8741342-35af-407b-9986-fef48f432ae8' of framework
> 20140711-183251-3078776074-5050-5240-0083
> I0722 02:08:36.099797  6771 slave.cpp:2018] Handling status update
> TASK_STARTING (UUID: 0196c27d-7169-4e2d-95b4-2cd5070a68f5) for task
> c448be94-cd7f-4623-8348-21cf95651ee8 of framework
> 20140711-183251-3078776074-5050-5240-0083 from
> 3f909650-d449-4a56-8b54-a43d102995c4@10.101.130.183:37954
> I0722 02:08:36.099890  6771 status_update_manager.cpp:320] Received
> status update TASK_STARTING (UUID:
> 0196c27d-7169-4e2d-95b4-2cd5070a68f5) for task
> c448be94-cd7f-4623-8348-21cf95651ee8 of framework
> 20140711-183251-3078776074-5050-5240-0083
> I0722 02:08:36.100342  6771 status_update_manager.cpp:373] Forwarding
> status update TASK_STARTING (UUID:
> 0196c27d-7169-4e2d-95b4-2cd5070a68f5) for task
> c448be94-cd7f-4623-8348-21cf95651ee8 of framework
> 20140711-183251-3078776074-5050-5240-0083 to
> master@10.101.130.183:5050
> I0722 02:08:36.100461  6771 slave.cpp:2145] Sending acknowledgement
> for status update TASK_STARTING (UUID:
> 0196c27d-7169-4e2d-95b4-2cd5070a68f5) for task
> c448be94-cd7f-4623-8348-21cf95651ee8 of framework
> 20140711-183251-3078776074-5050-5240-0083 to
> 3f909650-d449-4a56-8b54-a43d102995c4@10.101.130.183:37954
> I0722 02:08:36.103785  6769 status_update_manager.cpp:398] Received
> status update acknowledgement (UUID:
> 0196c27d-7169-4e2d-95b4-2cd5070a68f5) for task
> c448be94-cd7f-4623-8348-21cf95651ee8 of framework
> 20140711-183251-3078776074-5050-5240-0083
> I0722 02:08:36.158952  6769 slave.cpp:2018] Handling status update
> TASK_RUNNING (UUID: 888db1de-7a44-43f8-8025-f19f80e8d499) for task
> c448be94-cd7f-4623-8348-21cf95651ee8 of framework
> 20140711-183251-3078776074-5050-5240-0083 from
> 3f909650-d449-4a56-8b54-a43d102995c4@10.101.130.183:37954
> I0722 02:08:36.159203  6769 status_update_manager.cpp:320] Received
> status update TASK_RUNNING (UUID:
> 888db1de-7a44-43f8-8025-f19f80e8d499) for task
> c448be94-cd7f-4623-8348-21cf95651ee8 of framework
> 20140711-183251-3078776074-5050-5240-0083
> I0722 02:08:36.159416  6769 status_update_manager.cpp:373] Forwarding
> status update TASK_RUNNING (UUID:
> 888db1de-7a44-43f8-8025-f19f80e8d499) for task
> c448be94-cd7f-4623-8348-21cf95651ee8 of framework
> 20140711-183251-3078776074-5050-5240-0083 to
> master@10.101.130.183:5050
> I0722 02:08:36.159515  6769 slave.cpp:2145] Sending acknowledgement
> for status update TASK_RUNNING (UUID:
> 888db1de-7a44-43f8-8025-f19f80e8d499) for task
> c448be94-cd7f-4623-8348-21cf95651ee8 of framework
> 20140711-183251-3078776074-5050-5240-0083 to
> 3f909650-d449-4a56-8b54-a43d102995c4@10.101.130.183:37954
> I0722 02:08:36.162083  6774 status_update_manager.cpp:398] Received
> status update acknowledgement (UUID:
> 888db1de-7a44-43f8-8025-f19f80e8d499) for task
> c448be94-cd7f-4623-8348-21cf95651ee8 of framework
> 20140711-183251-3078776074-5050-5240-0083
> I0722 02:09:40.064846  6772 slave.cpp:2018] Handling status update
> TASK_FINISHED (UUID: 40880600-8e57-4d5c-ae28-89ccb347aab1) for task
> c448be94-cd7f-4623-8348-21cf95651ee8 of framework
> 20140711-183251-3078776074-5050-5240-0083 from
> 3f909650-d449-4a56-8b54-a43d102995c4@10.101.130.183:37954
> I0722 02:09:40.065062  6772 status_update_manager.cpp:320] Received
> status update TASK_FINISHED (UUID:
> 40880600-8e57-4d5c-ae28-89ccb347aab1) for task
> c448be94-cd7f-4623-8348-21cf95651ee8 of framework
> 20140711-183251-3078776074-5050-5240-0083
> I0722 02:09:40.065203  6772 status_update_manager.cpp:373] Forwarding
> status update TASK_FINISHED (UUID:
> 40880600-8e57-4d5c-ae28-89ccb347aab1) for task
> c448be94-cd7f-4623-8348-21cf95651ee8 of framework
> 20140711-183251-3078776074-5050-5240-0083 to
> master@10.101.130.183:5050
> I0722 02:09:40.065286  6772 slave.cpp:2145] Sending acknowledgement
> for status update TASK_FINISHED (UUID:
> 40880600-8e57-4d5c-ae28-89ccb347aab1) for task
> c448be94-cd7f-4623-8348-21cf95651ee8 of framework
> 20140711-183251-3078776074-5050-5240-0083 to
> 3f909650-d449-4a56-8b54-a43d102995c4@10.101.130.183:37954
> I0722 02:09:40.068207  6768 status_update_manager.cpp:398] Received
> status update acknowledgement (UUID:
> 40880600-8e57-4d5c-ae28-89ccb347aab1) for task
> c448be94-cd7f-4623-8348-21cf95651ee8 of framework
> 20140711-183251-3078776074-5050-5240-0083
> I0722 02:09:40.241500  6768 mesos_containerizer.cpp:1108] Executor for
> container '1db76576-e0fb-42b5-b9c7-3877e79e35d5' has exited
> I0722 02:09:40.243710  6768 mesos_containerizer.cpp:992] Destroying
> container '1db76576-e0fb-42b5-b9c7-3877e79e35d5'
>
>
> On the executor side, for the failed task, I see
>
> 02:09:35.509 [main] INFO  c.g.c.mesos.CandylandExecutor - ***
> Candyland Executor Main running
> 02:09:35.620 [main] INFO  c.g.c.mesos.CandylandExecutor - *** Entering
> Driver loop
> 02:09:35.663 [main] INFO  org.xnio - XNIO version 3.2.0.Final
> 02:09:35.754 [main] INFO  org.xnio.nio - XNIO NIO Implementation
> Version 3.2.0.Final
> 02:09:35.894 [executor-driver-0] DEBUG
> c.g.mesos.util.HttpProtocolSender - Sending to URL
> http://10.101.130.183:5051/slave(1)/mesos.internal.RegisterExecutorMessage
> :
> framework_id {
>   value: "20140711-183251-3078776074-5050-5240-0083"
> }
> executor_id {
>   value: "candyland_e3030c10-d154-4a34-a72f-aba07e1a84d4"
> }
>
> 02:09:36.146 [OkHttp
> http://10.101.130.183:5051/slave(1)/mesos.internal.RegisterExecutorMessage
> ]
> DEBUG c.g.mesos.util.HttpProtocolSender - Response POST
> http://10.101.130.183:5051/slave(1)/mesos.internal.RegisterExecutorMessage
> :
> 202
>
> but nothing else. For the succeeding task, I see
>
> 02:08:35.437 [main] INFO  c.g.c.mesos.CandylandExecutor - ***
> Candyland Executor Main running
> 02:08:35.571 [main] INFO  c.g.c.mesos.CandylandExecutor - *** Entering
> Driver loop
> 02:08:35.614 [main] INFO  org.xnio - XNIO version 3.2.0.Final
> 02:08:35.673 [main] INFO  org.xnio.nio - XNIO NIO Implementation
> Version 3.2.0.Final
> 02:08:35.820 [executor-driver-0] DEBUG
> c.g.mesos.util.HttpProtocolSender - Sending to URL
> http://10.101.130.183:5051/slave(1)/mesos.internal.RegisterExecutorMessage
> :
> framework_id {
>   value: "20140711-183251-3078776074-5050-5240-0083"
> }
> executor_id {
>   value: "candyland_e8741342-35af-407b-9986-fef48f432ae8"
> }
>
> 02:08:35.979 [OkHttp
> http://10.101.130.183:5051/slave(1)/mesos.internal.RegisterExecutorMessage
> ]
> DEBUG c.g.mesos.util.HttpProtocolSender - Response POST
> http://10.101.130.183:5051/slave(1)/mesos.internal.RegisterExecutorMessage
> :
> 202
> 02:08:36.062 [executor-callback-0] INFO  c.g.c.mesos.CandylandExecutor
> - *** Launch Task, driver is
> com.groupon.mesos.JesosExecutorDriver@10ae9677
> 02:08:36.075 [executor-callback-1] INFO  c.g.c.mesos.CandylandExecutor
> - *** Registered, driver is
> com.groupon.mesos.JesosExecutorDriver@10ae9677
> 02:08:36.096 [executor-driver-0] DEBUG
> c.g.mesos.util.HttpProtocolSender - Sending to URL
> http://10.101.130.183:5051/slave(1)/mesos.internal.StatusUpdateMessage:
> update {
>   framework_id {
>     value: "20140711-183251-3078776074-5050-5240-0083"
>   }
>   executor_id {
>     value: "candyland_e8741342-35af-407b-9986-fef48f432ae8"
>   }
>   slave_id {
>     value: "20140711-183251-3078776074-5050-5240-0"
>   }
>   status {
>     task_id {
>       value: "c448be94-cd7f-4623-8348-21cf95651ee8"
>     }
>     state: TASK_STARTING
>     timestamp: 1.405994916E9
>   }
>   timestamp: 1.405994916E9
>   uuid: "\001\226\302}qiN-\225\264,\325\a\nh\365"
> }
> pid: "3f909650-d449-4a56-8b54-a43d102995c4@vagrant-7028b966.local:37954"
>
> 02:08:36.100 [OkHttp
> http://10.101.130.183:5051/slave(1)/mesos.internal.StatusUpdateMessage]
> DEBUG c.g.mesos.util.HttpProtocolSender - Response POST
> http://10.101.130.183:5051/slave(1)/mesos.internal.StatusUpdateMessage:
> 202
>
>
> For a successful task, my executor receives a launchTask message right
> after the registerExecutor message. For the failing one, I never
> receive launchTask from the slave. And the task hangs around in
> STARTING forever.
>
> I can somewhat reliably reproduce this with mesos 0.19.0 running in
> vagrant (it will show up running my test within an hour). Running
> mesos 0.19.1 on bare metal, I could not reproduce it in many hours of
> starting and stopping tasks, so it might be gone in 0.19.1.
>
> If that behaviour is to be expected (with messages being unreliable),
> should I more aggressively try to reconcile these starting tasks?
> Right now, I only reconcile tasks that at least returned a single
> status update message from the master.
>
> The only data point that I have is that for a failed taks, I see
> additional messages from other tasks interspersed in the log. For a
> succeeding task, there is no other task doing updates. This is what I
> saw interspersed with the log messages of the failed task.
>
> I0722 02:09:35.071388  6767 slave.cpp:1664] Got registration for
> executor 'candyland_810ffa40-5572-4619-ba3c-680a8d12166c' of framework
> 20140711-183251-3078776074-5050-5240-0083
> I0722 02:09:35.071637  6767 slave.cpp:1783] Flushing queued task
> dc5677b3-5b3b-4c3d-b750-19b5edea42eb for executor
> 'candyland_810ffa40-5572-4619-ba3c-680a8d12166c' of framework
> 20140711-183251-3078776074-5050-5240-0083
> I0722 02:09:35.237853  6771 mesos_containerizer.cpp:1108] Executor for
> container '97af22bb-cc60-4e9b-93ca-22fab235fd39' has exited
> I0722 02:09:35.240733  6771 mesos_containerizer.cpp:992] Destroying
> container '97af22bb-cc60-4e9b-93ca-22fab235fd39'
> I0722 02:09:35.251838  6769 slave.cpp:2413] Executor
> 'candyland_7a83d24a-2449-4514-84f0-d960ee3c1fd5' of framework
> 20140711-183251-3078776074-5050-5240-0083 has exited with status 0
> I0722 02:09:35.252343  6769 slave.cpp:2552] Cleaning up executor
> 'candyland_7a83d24a-2449-4514-84f0-d960ee3c1fd5' of framework
> 20140711-183251-3078776074-5050-5240-0083
> I0722 02:09:35.252504  6769 gc.cpp:56] Scheduling
>
> '/tmp/mesos/slaves/20140711-183251-3078776074-5050-5240-0/frameworks/20140711-183251-3078776074-5050-5240-0083/executors/candyland_7a83d24a-2449-4514-84f0-d960ee3c1fd5/runs/97af22bb-cc60-4e9b-93ca-22fab235fd39'
> for gc 6.99999707835556days in the future
> I0722 02:09:35.252594  6769 gc.cpp:56] Scheduling
>
> '/tmp/mesos/slaves/20140711-183251-3078776074-5050-5240-0/frameworks/20140711-183251-3078776074-5050-5240-0083/executors/candyland_7a83d24a-2449-4514-84f0-d960ee3c1fd5'
> for gc 6.9999970780563days in the future
>
> Sorry for being somewhat vague, I am chasing this bug for while now
> and I would like to see whether this is me not understanding what is
> going on or an actual bug.
>
> Thanks,
>     Henning
>
> (I can make the candyland code available if that helps but it is
> really ugly test hacking code and it shows how little I know about
> Mesos. ;-) )
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message