mesos-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Vinod Kone" <vinodk...@gmail.com>
Subject Re: Review Request 26382: Add source and reason to TaskStatus.
Date Mon, 03 Nov 2014 22:56:02 GMT


> On Nov. 3, 2014, 10:13 p.m., Vinod Kone wrote:
> > src/slave/slave.cpp, line 2952
> > <https://reviews.apache.org/r/26382/diff/11/?file=746449#file746449line2952>
> >
> >     So, after thinking a bit more and looking at MESOS-343, what we really want
here is to set the reason for failed as oom or memory limit (REASON_MEMORY_LIMIT). Slave generates
TASK_FAILED here not because the executor terminated but becaused it oomed (and in the future
due to disk limit etc). Note that the slave also generates TASK_FAILED when command executor
unexpectedly terminates. This is because, under normal conditions, command executor is not
expected to terminate until the task finishes. For this case I think REASON_INVALID_COMMAND
or REASON_BAD_COMMAND or REASON_COMMAND_FAILED is a better reason than REASON_EXECUTOR_TERMINATED
because users have no idea about executors when using command tasks. 
> >     
> >     The right way to do this is to plumb the reason via the Termination protobuf.
That way any isolator can include the right reason if its corresponding limit is reached (memory,
disk etc). If you want to punt on the plumbing, please add a TODO and tracking ticket.
> 
> Dominic Hamon wrote:
>     if termination is set to 'killed' then should it be TASK_KILLED instead of TASK_FAILED?
>     
>     I think this needs a TODO as it changes the plumbing quite a bit (and tests, etc).
For now, is REASON_EXECUTOR_TERMINATED accurate enough? I could add something for command
vs containerizer vs executor, but that starts to require more plumbing pretty quickly.
> 
> Vinod Kone wrote:
>     ```
>     if termination is set to 'killed' then should it be TASK_KILLED instead of TASK_FAILED?
>     ```
>     
>     No. It should be TASK_FAILED. Those are the semantics.
>     
>     ```
>     I think this needs a TODO as it changes the plumbing quite a bit (and tests, etc)
>     ```
>     
>     Which one? Plumbing via the termination protobuf or setting REASON_FAILED_COMMAND
and REASON_MEMORY_LIMIT? The latter could be done without any plumbing; all changes will be
in executorTerminated().
> 
> Dominic Hamon wrote:
>     this assumes that the termination reason is memory limit though, right?
>     
>     and we still don't have a reason for TASK_LOST.

```
this assumes that the termination reason is memory limit though, right?
```
thats correct, which is true as of today.

```
and we still don't have a reason for TASK_LOST.
```

TASK_LOST should be REASON_EXECUTOR_TERMINATED as you already had. That should only apply
for non-terminal tasks belonging to a terminated (non-oomed and non-command) executor.


- Vinod


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/26382/#review59654
-----------------------------------------------------------


On Oct. 31, 2014, 10:09 p.m., Dominic Hamon wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/26382/
> -----------------------------------------------------------
> 
> (Updated Oct. 31, 2014, 10:09 p.m.)
> 
> 
> Review request for mesos, Vinod Kone and Bill Farner.
> 
> 
> Bugs: MESOS-1830 and MESOS-343
>     https://issues.apache.org/jira/browse/MESOS-1830
>     https://issues.apache.org/jira/browse/MESOS-343
> 
> 
> Repository: mesos-git
> 
> 
> Description
> -------
> 
> Added source and reason, enabled TASK_ERROR, and made the changes necessary throughout
the codebase.
> 
> 
> Diffs
> -----
> 
>   include/mesos/mesos.proto 168a7a8c35ed1bf3f5bd6d7431b1e511bae7b789 
>   src/common/protobuf_utils.hpp 212d5124b9a4cc58e61719fa7f07a61cd166e834 
>   src/common/protobuf_utils.cpp a9b65e328c4c62bff7fbf5633dda25d742d79019 
>   src/examples/balloon_framework.cpp b05d5679fe2915142907af0b2dc00c6cd76eb9c1 
>   src/examples/java/TestFramework.java bc593d0abfacb00690b1492b2b82c970f4e4de6d 
>   src/examples/low_level_scheduler_libprocess.cpp 7ef5ea78ade4ed856b97009fdfe31281f0a55c17

>   src/examples/low_level_scheduler_pthread.cpp 6e233a10117a1c7aa669806b5b430e746e227ee5

>   src/examples/no_executor_framework.cpp f98a0735b9f287e7f1bf98af6c2e9a47ca6a77b2 
>   src/examples/test_framework.cpp 187a611ebfe35cb13ee48aa5eca934cf55f34dea 
>   src/master/master.cpp 762d2ff6c168ac212f70b43275692a77496a7fcd 
>   src/sched/sched.cpp 0fb8c7bda75545389f8024489b3c76ae115111f4 
>   src/slave/slave.cpp 96fb5f7385b0762d46d8129f7e43207bd6311644 
>   src/tests/fault_tolerance_tests.cpp a18a41a3e34ff112e04e693447d757403e5013bd 
>   src/tests/master_authorization_tests.cpp 652e80d0d4567b225c6ffb326725ddfde06f7fd3 
>   src/tests/master_tests.cpp 2e525749247626c05efb2f54a707599facb114b6 
>   src/tests/resource_offers_tests.cpp fe66432b1bf75ee25feb73c4bb353e4d4e5b503f 
> 
> Diff: https://reviews.apache.org/r/26382/diff/
> 
> 
> Testing
> -------
> 
> make check
> 
> 
> Thanks,
> 
> Dominic Hamon
> 
>


Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message