mesos-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Vinod Kone <vinodk...@gmail.com>
Subject Re: Sometimes finished tasks get 'lost'
Date Mon, 05 Aug 2013 22:40:37 GMT
Hey Brenden,

The "ignoring registration timeout" log message is innocuous (I'll probably
change its wording one of these days). I think there is an outstanding bug
where a framework moved to "completedFrameworks" is not moved back to
"frameworks" when the framework with same id is seen by mesos.


On Mon, Aug 5, 2013 at 1:08 PM, Brenden Matthews <
brenden.matthews@airbedandbreakfast.com> wrote:

> Hey folks,
>
> Occassionally, a task runs normally and finishes, but the slave ignores the
> registration timeout because it thinks the framework has exited.  In my
> example below, the task is "echo hi" in chronos.  I've also observed this
> with other frameworks, such as Hadoop.  This is a problem for chronos tasks
> in particular, because after this occurs I'm unable to retrieve the task
> logs via the Mesos web UI.  These task logs are very helpful for debugging
> chronos jobs.
>
> Here's the log from the master followed by the slave log:
>
> I0805 19:36:06.977514 16379 master.hpp:311] Adding task
> > ct:1375731364150:0:level2 with resources cpus(*):1; mem(*):1024;
> > disk(*):1024 on slave 201308022130-200446986-5050-11821-0
> > I0805 19:36:06.981220 16379 master.cpp:1718] Launching task
> > ct:1375731364150:0:level2 of framework chronos with resources cpus(*):1;
> > mem(*):1024; disk(*):1024 on slave 201308022130-200446986-5050-11821-0
> > (i-c9b911b0)
> > I0805 19:36:07.108693 16373 master.cpp:1130] Status update from slave(1)@
> > 10.148.197.225:5051: task ct:1375731364150:0:level2 of framework chronos
> > is now in state TASK_RUNNING
> > I0805 19:36:07.146131 16369 master.cpp:1130] Status update from slave(1)@
> > 10.148.164.186:5051: task ct:1375731364150:0:level3 of framework chronos
> > is now in state TASK_FINISHED
> > I0805 19:36:07.146595 16369 master.hpp:323] Removing task
> > ct:1375731364150:0:level3 with resources cpus(*):1; mem(*):1024;
> > disk(*):1024 on slave 201308022130-200446986-5050-11821-1
> > I0805 19:36:07.147125 16369 hierarchical_allocator_process.hpp:595]
> > Recovered cpus(*):1; mem(*):1024; disk(*):1024 (total allocatable:
> > cpus(*):1.4; mem(*):1275; disk(*):610649; ports(*):[31002-32000]) on
> slave
> > 201308022130-200446986-5050-11821-1 from framework chronos
> > I0805 19:36:07.971992 16363 master.cpp:1366] Sending 1 offers to
> framework
> > chronos
> > I0805 19:36:07.973424 16374 master.cpp:1599] Processing reply for offer
> > 201308051833-200446986-5050-16348-5068 on slave
> > 201308022130-200446986-5050-11821-1 (i-d7b911ae) for framework chronos
> > I0805 19:36:07.973973 16367 hierarchical_allocator_process.hpp:548]
> > Framework chronos filtered slave 201308022130-200446986-5050-11821-1 for
> > 5secs
> > I0805 19:36:08.102439 16364 master.cpp:1130] Status update from slave(1)@
> > 10.148.197.225:5051: task ct:1375731364150:0:level2 of framework chronos
> > is now in state TASK_FINISHED
> > I0805 19:36:08.102907 16364 master.hpp:323] Removing task
> > ct:1375731364150:0:level2 with resources cpus(*):1; mem(*):1024;
> > disk(*):1024 on slave 201308022130-200446986-5050-11821-0
>
>
>
> I0805 19:36:06.982079 14728 slave.cpp:747] Got assigned task
> > ct:1375731364150:0:level2 for framework chronos
> > I0805 19:36:06.983682 14728 gc.cpp:84] Unscheduling
> >
> '/tmp/mesos/slaves/201308022130-200446986-5050-11821-0/frameworks/chronos'
> > for removal
> > I0805 19:36:06.984241 14746 slave.cpp:845] Launching task
> > ct:1375731364150:0:level2 for framework chronos
> > I0805 19:36:06.986026 14746 paths.hpp:303] Created executor directory
> >
> '/tmp/mesos/slaves/201308022130-200446986-5050-11821-0/frameworks/chronos/executors/ct:137573136
> > 4150:0:level2/runs/7869f30a-7700-4cd7-8bb4-4a8ed618b5ce'
> > I0805 19:36:06.986639 14746 slave.cpp:956] Queuing task
> > 'ct:1375731364150:0:level2' for executor ct:1375731364150:0:level2 of
> > framework 'chronos
> > I0805 19:36:06.987416 14749 cgroups_isolator.cpp:513] Launching
> > ct:1375731364150:0:level2 (/usr/local/libexec/mesos/mesos-executor) in
> > /tmp/mesos/slaves/201308022130-
> >
> 200446986-5050-11821-0/frameworks/chronos/executors/ct:1375731364150:0:level2/runs/7869f30a-7700-4cd7-8bb4-4a8ed618b5ce
> > with resources  for framework chronos in cgrou
> > p
> >
> mesos/framework_chronos_executor_ct:1375731364150:0:level2_tag_7869f30a-7700-4cd7-8bb4-4a8ed618b5ce
> > I0805 19:36:06.988970 14746 slave.cpp:519] Successfully attached file
> >
> '/tmp/mesos/slaves/201308022130-200446986-5050-11821-0/frameworks/chronos/executors/ct:137573136
> > 4150:0:level2/runs/7869f30a-7700-4cd7-8bb4-4a8ed618b5ce'
> > I0805 19:36:06.991384 14749 cgroups_isolator.cpp:656] Changing cgroup
> > controls for executor ct:1375731364150:0:level2 of framework chronos with
> > resources
> > I0805 19:36:06.993352 14749 cgroups_isolator.cpp:1053] Started listening
> > for OOM events for executor ct:1375731364150:0:level2 of framework
> chronos
> > I0805 19:36:06.995985 14749 cgroups_isolator.cpp:543] Forked executor at
> =
> > 25602
> > I0805 19:36:07.099602 14739 slave.cpp:1423] Got registration for executor
> > 'ct:1375731364150:0:level2' of framework chronos
> > I0805 19:36:07.100589 14739 slave.cpp:1538] Flushing queued task
> > ct:1375731364150:0:level2 for executor 'ct:1375731364150:0:level2' of
> > framework chronos
> > I0805 19:36:07.100688 14738 cgroups_isolator.cpp:656] Changing cgroup
> > controls for executor ct:1375731364150:0:level2 of framework chronos with
> > resources cpus(*):1; m
> > em(*):1024; disk(*):1024
> > I0805 19:36:07.102645 14738 cgroups_isolator.cpp:964] Updated
> > 'cpu.cfs_period_us' to 100000 and 'cpu.cfs_quota_us' to 100000 for
> executor
> > ct:1375731364150:0:level2 of
> >  framework chronos
> > I0805 19:36:07.103688 14738 cgroups_isolator.cpp:889] Updated
> 'cpu.shares'
> > to 1024 for executor ct:1375731364150:0:level2 of framework chronos
> > I0805 19:36:07.104630 14738 cgroups_isolator.cpp:1027] Updated
> > 'memory.soft_limit_in_bytes' to 1073741824 for executor
> > ct:1375731364150:0:level2 of framework chronos
> > I0805 19:36:07.105954 14728 slave.cpp:1720] Handling status update
> > TASK_RUNNING (UUID: 9a06b87f-60e9-45f0-8c8d-8128e06e1fd5) for task
> > ct:1375731364150:0:level2 of framework chronos from executor(1)@
> > 10.148.197.225:45297
> > I0805 19:36:07.106722 14741 status_update_manager.cpp:290] Received
> status
> > update TASK_RUNNING (UUID: 9a06b87f-60e9-45f0-8c8d-8128e06e1fd5) for task
> > ct:1375731364150:0:level2 of framework chronos with checkpoint=false
> > I0805 19:36:07.107151 14741 status_update_manager.cpp:450] Creating
> > StatusUpdate stream for task ct:1375731364150:0:level2 of framework
> chronos
> > I0805 19:36:07.107681 14741 status_update_manager.cpp:336] Forwarding
> > status update TASK_RUNNING (UUID: 9a06b87f-60e9-45f0-8c8d-8128e06e1fd5)
> for
> > task ct:1375731364150:0:level2 of framework chronos to
> > master@10.148.242.11:5050
> > I0805 19:36:07.108211 14728 slave.cpp:1837] Sending acknowledgement for
> > status update TASK_RUNNING (UUID: 9a06b87f-60e9-45f0-8c8d-8128e06e1fd5)
> for
> > task ct:1375731364150:0:level2 of framework chronos to executor(1)@
> > 10.148.197.225:45297
> > I0805 19:36:07.110834 14735 status_update_manager.cpp:360] Received
> status
> > update acknowledgement 9a06b87f-60e9-45f0-8c8d-8128e06e1fd5 for task
> > ct:1375731364150:0:level2 of framework chronos
> > I0805 19:36:07.105954 14728 slave.cpp:1720] Handling status update
> > TASK_RUNNING (UUID: 9a06b87f-60e9-45f0-8c8d-8128e06e1fd5) for task
> > ct:1375731364150:0:level2 of framework chronos from executor(1)@
> > 10.148.197.225:45297
> > I0805 19:36:07.106722 14741 status_update_manager.cpp:290] Received
> status
> > update TASK_RUNNING (UUID: 9a06b87f-60e9-45f0-8c8d-8128e06e1fd5) for task
> > ct:1375731364150:0:level2 of framework chronos with checkpoint=false
> > I0805 19:36:07.107151 14741 status_update_manager.cpp:450] Creating
> > StatusUpdate stream for task ct:1375731364150:0:level2 of framework
> chronos
> > I0805 19:36:07.107681 14741 status_update_manager.cpp:336] Forwarding
> > status update TASK_RUNNING (UUID: 9a06b87f-60e9-45f0-8c8d-8128e06e1fd5)
> for
> > task ct:1375731364150:0:level2 of framework chronos to
> > master@10.148.242.11:5050
> > I0805 19:36:07.108211 14728 slave.cpp:1837] Sending acknowledgement for
> > status update TASK_RUNNING (UUID: 9a06b87f-60e9-45f0-8c8d-8128e06e1fd5)
> for
> > task ct:1375731364150:0:level2 of framework chronos to executor(1)@
> > 10.148.197.225:45297
> > I0805 19:36:07.110834 14735 status_update_manager.cpp:360] Received
> status
> > update acknowledgement 9a06b87f-60e9-45f0-8c8d-8128e06e1fd5 for task
> > ct:1375731364150:0:level2 of framework chronos
> > I0805 19:36:08.100355 14741 slave.cpp:1720] Handling status update
> > TASK_FINISHED (UUID: 284d4591-06b4-410c-b237-5d740a78ad88) for task
> > ct:1375731364150:0:level2 of framework chronos from executor(1)@
> > 10.148.197.225:45297
> > I0805 19:36:08.100966 14739 status_update_manager.cpp:290] Received
> status
> > update TASK_FINISHED (UUID: 284d4591-06b4-410c-b237-5d740a78ad88) for
> task
> > ct:1375731364150:0:level2 of framework chronos with checkpoint=false
> > I0805 19:36:08.101403 14739 status_update_manager.cpp:336] Forwarding
> > status update TASK_FINISHED (UUID: 284d4591-06b4-410c-b237-5d740a78ad88)
> > for task ct:1375731364150:0:level2 of framework chronos to
> > master@10.148.242.11:5050
> > I0805 19:36:08.101419 14740 cgroups_isolator.cpp:656] Changing cgroup
> > controls for executor ct:1375731364150:0:level2 of framework chronos with
> > resources
> > I0805 19:36:08.102355 14750 slave.cpp:1837] Sending acknowledgement for
> > status update TASK_FINISHED (UUID: 284d4591-06b4-410c-b237-5d740a78ad88)
> > for task ct:1375731364150:0:level2 of framework chronos to executor(1)@
> > 10.148.197.225:45297
> > I0805 19:36:08.114512 14727 status_update_manager.cpp:360] Received
> status
> > update acknowledgement 284d4591-06b4-410c-b237-5d740a78ad88 for task
> > ct:1375731364150:0:level2 of framework chronos
> > I0805 19:36:08.115033 14727 status_update_manager.cpp:481] Cleaning up
> > status update stream for task ct:1375731364150:0:level2 of framework
> chronos
> > I0805 19:36:09.707617 14741 cgroups_isolator.cpp:851] Executor
> > ct:1375731364150:0:level2 of framework chronos terminated with status 0
> > I0805 19:36:09.708048 14741 cgroups_isolator.cpp:619] Killing executor
> > ct:1375731364150:0:level2 of framework chronos
> > I0805 19:36:09.709691 14741 cgroups_isolator.cpp:1073] OOM notifier is
> > triggered for executor ct:1375731364150:0:level2 of framework chronos
> with
> > uuid 7869f30a-7700-4cd7-8bb4-4a8ed618b5ce
> > I0805 19:36:09.710186 14741 cgroups_isolator.cpp:1078] Discarded OOM
> > notifier for executor ct:1375731364150:0:level2 of framework chronos with
> > uuid 7869f30a-7700-4cd7-8bb4-4a8ed618b5ce
> > I0805 19:36:09.715747 14741 cgroups.cpp:1193] Trying to freeze cgroup
> >
> /cgroup/mesos/framework_chronos_executor_ct:1375731364150:0:level2_tag_7869f30a-7700-4cd7-8bb4-4a8ed618b5ce
> > I0805 19:36:09.718188 14741 cgroups.cpp:1232] Successfully froze cgroup
> >
> /cgroup/mesos/framework_chronos_executor_ct:1375731364150:0:level2_tag_7869f30a-7700-4cd7-8bb4-4a8ed618b5ce
> > after 1 attempts
> > I0805 19:36:09.724428 14731 cgroups.cpp:1208] Trying to thaw cgroup
> >
> /cgroup/mesos/framework_chronos_executor_ct:1375731364150:0:level2_tag_7869f30a-7700-4cd7-8bb4-4a8ed618b5ce
> > I0805 19:36:09.724879 14731 cgroups.cpp:1318] Successfully thawed
> >
> /cgroup/mesos/framework_chronos_executor_ct:1375731364150:0:level2_tag_7869f30a-7700-4cd7-8bb4-4a8ed618b5ce
> > I0805 19:36:09.882351 14740 cgroups_isolator.cpp:1192] Successfully
> > destroyed cgroup
> >
> mesos/framework_chronos_executor_ct:1375731364150:0:level2_tag_7869f30a-7700-4cd7-8bb4-4a8ed618b5ce
> > I0805 19:36:09.884524 14722 slave.cpp:2088] Executor
> > 'ct:1375731364150:0:level2' of framework chronos has exited with status 0
> > I0805 19:36:09.885027 14722 slave.cpp:2223] Cleaning up executor
> > 'ct:1375731364150:0:level2' of framework chronos
> > I0805 19:36:09.889178 14740 gc.cpp:56] Scheduling
> >
> '/tmp/mesos/slaves/201308022130-200446986-5050-11821-0/frameworks/chronos/executors/ct:1375731364150:0:level2/runs/7869f30a-7700-4cd7-8bb4-4a8ed618b5ce'
> > for removal
> > I0805 19:36:09.889292 14722 slave.cpp:2287] Cleaning up framework chronos
> > I0805 19:36:09.889802 14740 gc.cpp:56] Scheduling
> >
> '/tmp/mesos/slaves/201308022130-200446986-5050-11821-0/frameworks/chronos/executors/ct:1375731364150:0:level2'
> > for removal
> > I0805 19:36:09.890043 14728 status_update_manager.cpp:252] Closing status
> > update streams for framework chronos
> > I0805 19:36:09.890599 14740 gc.cpp:56] Scheduling
> >
> '/tmp/mesos/slaves/201308022130-200446986-5050-11821-0/frameworks/chronos'
> > for removal
> > I0805 19:36:38.089031 14736 slave.cpp:2539] Current usage 0.47%. Max
> > allowed age: 1.790501830619676days
> > I0805 19:37:01.998739 14748 slave.cpp:2446] Framework chronos seems to
> > have exited. Ignoring registration timeout for executor
> > 'ct:1375731357036:0:level3'
> > I0805 19:37:04.006137 14735 slave.cpp:2446] Framework chronos seems to
> > have exited. Ignoring registration timeout for executor
> > 'ct:1375731357119:0:echo hi'
> > I0805 19:37:06.987412 14724 slave.cpp:2446] Framework chronos seems to
> > have exited. Ignoring registration timeout for executor
> > 'ct:1375731364150:0:level2'
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message