mesos-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Benjamin Mahler <benjamin.mah...@gmail.com>
Subject Re: Sometimes finished tasks get 'lost'
Date Mon, 05 Aug 2013 23:33:24 GMT
Created https://issues.apache.org/jira/browse/MESOS-616 to directly address
this.


On Mon, Aug 5, 2013 at 3:40 PM, Vinod Kone <vinodkone@gmail.com> wrote:

> Hey Brenden,
>
> The "ignoring registration timeout" log message is innocuous (I'll probably
> change its wording one of these days). I think there is an outstanding bug
> where a framework moved to "completedFrameworks" is not moved back to
> "frameworks" when the framework with same id is seen by mesos.
>
>
> On Mon, Aug 5, 2013 at 1:08 PM, Brenden Matthews <
> brenden.matthews@airbedandbreakfast.com> wrote:
>
> > Hey folks,
> >
> > Occassionally, a task runs normally and finishes, but the slave ignores
> the
> > registration timeout because it thinks the framework has exited.  In my
> > example below, the task is "echo hi" in chronos.  I've also observed this
> > with other frameworks, such as Hadoop.  This is a problem for chronos
> tasks
> > in particular, because after this occurs I'm unable to retrieve the task
> > logs via the Mesos web UI.  These task logs are very helpful for
> debugging
> > chronos jobs.
> >
> > Here's the log from the master followed by the slave log:
> >
> > I0805 19:36:06.977514 16379 master.hpp:311] Adding task
> > > ct:1375731364150:0:level2 with resources cpus(*):1; mem(*):1024;
> > > disk(*):1024 on slave 201308022130-200446986-5050-11821-0
> > > I0805 19:36:06.981220 16379 master.cpp:1718] Launching task
> > > ct:1375731364150:0:level2 of framework chronos with resources
> cpus(*):1;
> > > mem(*):1024; disk(*):1024 on slave 201308022130-200446986-5050-11821-0
> > > (i-c9b911b0)
> > > I0805 19:36:07.108693 16373 master.cpp:1130] Status update from
> slave(1)@
> > > 10.148.197.225:5051: task ct:1375731364150:0:level2 of framework
> chronos
> > > is now in state TASK_RUNNING
> > > I0805 19:36:07.146131 16369 master.cpp:1130] Status update from
> slave(1)@
> > > 10.148.164.186:5051: task ct:1375731364150:0:level3 of framework
> chronos
> > > is now in state TASK_FINISHED
> > > I0805 19:36:07.146595 16369 master.hpp:323] Removing task
> > > ct:1375731364150:0:level3 with resources cpus(*):1; mem(*):1024;
> > > disk(*):1024 on slave 201308022130-200446986-5050-11821-1
> > > I0805 19:36:07.147125 16369 hierarchical_allocator_process.hpp:595]
> > > Recovered cpus(*):1; mem(*):1024; disk(*):1024 (total allocatable:
> > > cpus(*):1.4; mem(*):1275; disk(*):610649; ports(*):[31002-32000]) on
> > slave
> > > 201308022130-200446986-5050-11821-1 from framework chronos
> > > I0805 19:36:07.971992 16363 master.cpp:1366] Sending 1 offers to
> > framework
> > > chronos
> > > I0805 19:36:07.973424 16374 master.cpp:1599] Processing reply for offer
> > > 201308051833-200446986-5050-16348-5068 on slave
> > > 201308022130-200446986-5050-11821-1 (i-d7b911ae) for framework chronos
> > > I0805 19:36:07.973973 16367 hierarchical_allocator_process.hpp:548]
> > > Framework chronos filtered slave 201308022130-200446986-5050-11821-1
> for
> > > 5secs
> > > I0805 19:36:08.102439 16364 master.cpp:1130] Status update from
> slave(1)@
> > > 10.148.197.225:5051: task ct:1375731364150:0:level2 of framework
> chronos
> > > is now in state TASK_FINISHED
> > > I0805 19:36:08.102907 16364 master.hpp:323] Removing task
> > > ct:1375731364150:0:level2 with resources cpus(*):1; mem(*):1024;
> > > disk(*):1024 on slave 201308022130-200446986-5050-11821-0
> >
> >
> >
> > I0805 19:36:06.982079 14728 slave.cpp:747] Got assigned task
> > > ct:1375731364150:0:level2 for framework chronos
> > > I0805 19:36:06.983682 14728 gc.cpp:84] Unscheduling
> > >
> >
> '/tmp/mesos/slaves/201308022130-200446986-5050-11821-0/frameworks/chronos'
> > > for removal
> > > I0805 19:36:06.984241 14746 slave.cpp:845] Launching task
> > > ct:1375731364150:0:level2 for framework chronos
> > > I0805 19:36:06.986026 14746 paths.hpp:303] Created executor directory
> > >
> >
> '/tmp/mesos/slaves/201308022130-200446986-5050-11821-0/frameworks/chronos/executors/ct:137573136
> > > 4150:0:level2/runs/7869f30a-7700-4cd7-8bb4-4a8ed618b5ce'
> > > I0805 19:36:06.986639 14746 slave.cpp:956] Queuing task
> > > 'ct:1375731364150:0:level2' for executor ct:1375731364150:0:level2 of
> > > framework 'chronos
> > > I0805 19:36:06.987416 14749 cgroups_isolator.cpp:513] Launching
> > > ct:1375731364150:0:level2 (/usr/local/libexec/mesos/mesos-executor) in
> > > /tmp/mesos/slaves/201308022130-
> > >
> >
> 200446986-5050-11821-0/frameworks/chronos/executors/ct:1375731364150:0:level2/runs/7869f30a-7700-4cd7-8bb4-4a8ed618b5ce
> > > with resources  for framework chronos in cgrou
> > > p
> > >
> >
> mesos/framework_chronos_executor_ct:1375731364150:0:level2_tag_7869f30a-7700-4cd7-8bb4-4a8ed618b5ce
> > > I0805 19:36:06.988970 14746 slave.cpp:519] Successfully attached file
> > >
> >
> '/tmp/mesos/slaves/201308022130-200446986-5050-11821-0/frameworks/chronos/executors/ct:137573136
> > > 4150:0:level2/runs/7869f30a-7700-4cd7-8bb4-4a8ed618b5ce'
> > > I0805 19:36:06.991384 14749 cgroups_isolator.cpp:656] Changing cgroup
> > > controls for executor ct:1375731364150:0:level2 of framework chronos
> with
> > > resources
> > > I0805 19:36:06.993352 14749 cgroups_isolator.cpp:1053] Started
> listening
> > > for OOM events for executor ct:1375731364150:0:level2 of framework
> > chronos
> > > I0805 19:36:06.995985 14749 cgroups_isolator.cpp:543] Forked executor
> at
> > =
> > > 25602
> > > I0805 19:36:07.099602 14739 slave.cpp:1423] Got registration for
> executor
> > > 'ct:1375731364150:0:level2' of framework chronos
> > > I0805 19:36:07.100589 14739 slave.cpp:1538] Flushing queued task
> > > ct:1375731364150:0:level2 for executor 'ct:1375731364150:0:level2' of
> > > framework chronos
> > > I0805 19:36:07.100688 14738 cgroups_isolator.cpp:656] Changing cgroup
> > > controls for executor ct:1375731364150:0:level2 of framework chronos
> with
> > > resources cpus(*):1; m
> > > em(*):1024; disk(*):1024
> > > I0805 19:36:07.102645 14738 cgroups_isolator.cpp:964] Updated
> > > 'cpu.cfs_period_us' to 100000 and 'cpu.cfs_quota_us' to 100000 for
> > executor
> > > ct:1375731364150:0:level2 of
> > >  framework chronos
> > > I0805 19:36:07.103688 14738 cgroups_isolator.cpp:889] Updated
> > 'cpu.shares'
> > > to 1024 for executor ct:1375731364150:0:level2 of framework chronos
> > > I0805 19:36:07.104630 14738 cgroups_isolator.cpp:1027] Updated
> > > 'memory.soft_limit_in_bytes' to 1073741824 for executor
> > > ct:1375731364150:0:level2 of framework chronos
> > > I0805 19:36:07.105954 14728 slave.cpp:1720] Handling status update
> > > TASK_RUNNING (UUID: 9a06b87f-60e9-45f0-8c8d-8128e06e1fd5) for task
> > > ct:1375731364150:0:level2 of framework chronos from executor(1)@
> > > 10.148.197.225:45297
> > > I0805 19:36:07.106722 14741 status_update_manager.cpp:290] Received
> > status
> > > update TASK_RUNNING (UUID: 9a06b87f-60e9-45f0-8c8d-8128e06e1fd5) for
> task
> > > ct:1375731364150:0:level2 of framework chronos with checkpoint=false
> > > I0805 19:36:07.107151 14741 status_update_manager.cpp:450] Creating
> > > StatusUpdate stream for task ct:1375731364150:0:level2 of framework
> > chronos
> > > I0805 19:36:07.107681 14741 status_update_manager.cpp:336] Forwarding
> > > status update TASK_RUNNING (UUID: 9a06b87f-60e9-45f0-8c8d-8128e06e1fd5)
> > for
> > > task ct:1375731364150:0:level2 of framework chronos to
> > > master@10.148.242.11:5050
> > > I0805 19:36:07.108211 14728 slave.cpp:1837] Sending acknowledgement for
> > > status update TASK_RUNNING (UUID: 9a06b87f-60e9-45f0-8c8d-8128e06e1fd5)
> > for
> > > task ct:1375731364150:0:level2 of framework chronos to executor(1)@
> > > 10.148.197.225:45297
> > > I0805 19:36:07.110834 14735 status_update_manager.cpp:360] Received
> > status
> > > update acknowledgement 9a06b87f-60e9-45f0-8c8d-8128e06e1fd5 for task
> > > ct:1375731364150:0:level2 of framework chronos
> > > I0805 19:36:07.105954 14728 slave.cpp:1720] Handling status update
> > > TASK_RUNNING (UUID: 9a06b87f-60e9-45f0-8c8d-8128e06e1fd5) for task
> > > ct:1375731364150:0:level2 of framework chronos from executor(1)@
> > > 10.148.197.225:45297
> > > I0805 19:36:07.106722 14741 status_update_manager.cpp:290] Received
> > status
> > > update TASK_RUNNING (UUID: 9a06b87f-60e9-45f0-8c8d-8128e06e1fd5) for
> task
> > > ct:1375731364150:0:level2 of framework chronos with checkpoint=false
> > > I0805 19:36:07.107151 14741 status_update_manager.cpp:450] Creating
> > > StatusUpdate stream for task ct:1375731364150:0:level2 of framework
> > chronos
> > > I0805 19:36:07.107681 14741 status_update_manager.cpp:336] Forwarding
> > > status update TASK_RUNNING (UUID: 9a06b87f-60e9-45f0-8c8d-8128e06e1fd5)
> > for
> > > task ct:1375731364150:0:level2 of framework chronos to
> > > master@10.148.242.11:5050
> > > I0805 19:36:07.108211 14728 slave.cpp:1837] Sending acknowledgement for
> > > status update TASK_RUNNING (UUID: 9a06b87f-60e9-45f0-8c8d-8128e06e1fd5)
> > for
> > > task ct:1375731364150:0:level2 of framework chronos to executor(1)@
> > > 10.148.197.225:45297
> > > I0805 19:36:07.110834 14735 status_update_manager.cpp:360] Received
> > status
> > > update acknowledgement 9a06b87f-60e9-45f0-8c8d-8128e06e1fd5 for task
> > > ct:1375731364150:0:level2 of framework chronos
> > > I0805 19:36:08.100355 14741 slave.cpp:1720] Handling status update
> > > TASK_FINISHED (UUID: 284d4591-06b4-410c-b237-5d740a78ad88) for task
> > > ct:1375731364150:0:level2 of framework chronos from executor(1)@
> > > 10.148.197.225:45297
> > > I0805 19:36:08.100966 14739 status_update_manager.cpp:290] Received
> > status
> > > update TASK_FINISHED (UUID: 284d4591-06b4-410c-b237-5d740a78ad88) for
> > task
> > > ct:1375731364150:0:level2 of framework chronos with checkpoint=false
> > > I0805 19:36:08.101403 14739 status_update_manager.cpp:336] Forwarding
> > > status update TASK_FINISHED (UUID:
> 284d4591-06b4-410c-b237-5d740a78ad88)
> > > for task ct:1375731364150:0:level2 of framework chronos to
> > > master@10.148.242.11:5050
> > > I0805 19:36:08.101419 14740 cgroups_isolator.cpp:656] Changing cgroup
> > > controls for executor ct:1375731364150:0:level2 of framework chronos
> with
> > > resources
> > > I0805 19:36:08.102355 14750 slave.cpp:1837] Sending acknowledgement for
> > > status update TASK_FINISHED (UUID:
> 284d4591-06b4-410c-b237-5d740a78ad88)
> > > for task ct:1375731364150:0:level2 of framework chronos to executor(1)@
> > > 10.148.197.225:45297
> > > I0805 19:36:08.114512 14727 status_update_manager.cpp:360] Received
> > status
> > > update acknowledgement 284d4591-06b4-410c-b237-5d740a78ad88 for task
> > > ct:1375731364150:0:level2 of framework chronos
> > > I0805 19:36:08.115033 14727 status_update_manager.cpp:481] Cleaning up
> > > status update stream for task ct:1375731364150:0:level2 of framework
> > chronos
> > > I0805 19:36:09.707617 14741 cgroups_isolator.cpp:851] Executor
> > > ct:1375731364150:0:level2 of framework chronos terminated with status 0
> > > I0805 19:36:09.708048 14741 cgroups_isolator.cpp:619] Killing executor
> > > ct:1375731364150:0:level2 of framework chronos
> > > I0805 19:36:09.709691 14741 cgroups_isolator.cpp:1073] OOM notifier is
> > > triggered for executor ct:1375731364150:0:level2 of framework chronos
> > with
> > > uuid 7869f30a-7700-4cd7-8bb4-4a8ed618b5ce
> > > I0805 19:36:09.710186 14741 cgroups_isolator.cpp:1078] Discarded OOM
> > > notifier for executor ct:1375731364150:0:level2 of framework chronos
> with
> > > uuid 7869f30a-7700-4cd7-8bb4-4a8ed618b5ce
> > > I0805 19:36:09.715747 14741 cgroups.cpp:1193] Trying to freeze cgroup
> > >
> >
> /cgroup/mesos/framework_chronos_executor_ct:1375731364150:0:level2_tag_7869f30a-7700-4cd7-8bb4-4a8ed618b5ce
> > > I0805 19:36:09.718188 14741 cgroups.cpp:1232] Successfully froze cgroup
> > >
> >
> /cgroup/mesos/framework_chronos_executor_ct:1375731364150:0:level2_tag_7869f30a-7700-4cd7-8bb4-4a8ed618b5ce
> > > after 1 attempts
> > > I0805 19:36:09.724428 14731 cgroups.cpp:1208] Trying to thaw cgroup
> > >
> >
> /cgroup/mesos/framework_chronos_executor_ct:1375731364150:0:level2_tag_7869f30a-7700-4cd7-8bb4-4a8ed618b5ce
> > > I0805 19:36:09.724879 14731 cgroups.cpp:1318] Successfully thawed
> > >
> >
> /cgroup/mesos/framework_chronos_executor_ct:1375731364150:0:level2_tag_7869f30a-7700-4cd7-8bb4-4a8ed618b5ce
> > > I0805 19:36:09.882351 14740 cgroups_isolator.cpp:1192] Successfully
> > > destroyed cgroup
> > >
> >
> mesos/framework_chronos_executor_ct:1375731364150:0:level2_tag_7869f30a-7700-4cd7-8bb4-4a8ed618b5ce
> > > I0805 19:36:09.884524 14722 slave.cpp:2088] Executor
> > > 'ct:1375731364150:0:level2' of framework chronos has exited with
> status 0
> > > I0805 19:36:09.885027 14722 slave.cpp:2223] Cleaning up executor
> > > 'ct:1375731364150:0:level2' of framework chronos
> > > I0805 19:36:09.889178 14740 gc.cpp:56] Scheduling
> > >
> >
> '/tmp/mesos/slaves/201308022130-200446986-5050-11821-0/frameworks/chronos/executors/ct:1375731364150:0:level2/runs/7869f30a-7700-4cd7-8bb4-4a8ed618b5ce'
> > > for removal
> > > I0805 19:36:09.889292 14722 slave.cpp:2287] Cleaning up framework
> chronos
> > > I0805 19:36:09.889802 14740 gc.cpp:56] Scheduling
> > >
> >
> '/tmp/mesos/slaves/201308022130-200446986-5050-11821-0/frameworks/chronos/executors/ct:1375731364150:0:level2'
> > > for removal
> > > I0805 19:36:09.890043 14728 status_update_manager.cpp:252] Closing
> status
> > > update streams for framework chronos
> > > I0805 19:36:09.890599 14740 gc.cpp:56] Scheduling
> > >
> >
> '/tmp/mesos/slaves/201308022130-200446986-5050-11821-0/frameworks/chronos'
> > > for removal
> > > I0805 19:36:38.089031 14736 slave.cpp:2539] Current usage 0.47%. Max
> > > allowed age: 1.790501830619676days
> > > I0805 19:37:01.998739 14748 slave.cpp:2446] Framework chronos seems to
> > > have exited. Ignoring registration timeout for executor
> > > 'ct:1375731357036:0:level3'
> > > I0805 19:37:04.006137 14735 slave.cpp:2446] Framework chronos seems to
> > > have exited. Ignoring registration timeout for executor
> > > 'ct:1375731357119:0:echo hi'
> > > I0805 19:37:06.987412 14724 slave.cpp:2446] Framework chronos seems to
> > > have exited. Ignoring registration timeout for executor
> > > 'ct:1375731364150:0:level2'
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message