mesos-reviews mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Yubo Li <liyub...@cn.ibm.com>
Subject Re: Review Request 50841: Added GPU scheduling logic to docker containerizer process.
Date Thu, 22 Sep 2016 06:15:54 GMT


> On 八月 28, 2016, 12:57 a.m., Kevin Klues wrote:
> > src/slave/containerizer/docker.hpp, line 506
> > <https://reviews.apache.org/r/50841/diff/7/?file=1486671#file1486671line506>
> >
> >     Shoud we use `set` here like we do in the mesos containerizer?

Yes, changed it to `set`.


> On 八月 28, 2016, 12:57 a.m., Kevin Klues wrote:
> > src/slave/containerizer/docker.cpp, lines 1341-1346
> > <https://reviews.apache.org/r/50841/diff/7/?file=1486672#file1486672line1341>
> >
> >     You can just use `if (gpus.get() > 0)` here.

the temp variable `requested` is removed.


> On 八月 28, 2016, 12:57 a.m., Kevin Klues wrote:
> > src/slave/containerizer/docker.cpp, lines 1541-1552
> > <https://reviews.apache.org/r/50841/diff/7/?file=1486672#file1486672line1541>
> >
> >     You can't rely on `containers_[containerId]` to be valid at this point (hence
the check for it below).
> >     
> >     Regardless, I just at the head of master to see where the most appropriate place
to do this deallocation would be, and this function looks *completely* different from what
I see here.
> >     
> >     When was the last time you rebased?

The code is rebased at Aug. 15, let's keep this issue. I'll come back to this after I fix
all other bugs and rebase by code.
Code rebased, where is the most appropriate place to do so?
BTW, `deallocateNvidiaGpus()` will check if `containers_[containerId]` is valid now, so we
don't need to worry that `containers_[containerId]` is invalid.


> On 八月 28, 2016, 12:57 a.m., Kevin Klues wrote:
> > src/slave/containerizer/docker.cpp, lines 2037-2041
> > <https://reviews.apache.org/r/50841/diff/7/?file=1486672#file1486672line2037>
> >
> >     It feels too early to be doing the `deallocate()` here. I think we should only
do it once the container is actually killed (i.e. in `___destroy()`).
> >     
> >     If we do it too early, the GPUs could still be being used by the container when
they get allocated out to someone else. This would not go very well.
> >     
> >     There is the case, however, in `__destroy()` where we aren't able to kill the
docker container at all! It's better to leak the GPUs in this case than to deallocate them
and run the risk of giving them to a new container. We should mention this in the error string
below if any GPUs are actually leaked:
> >     
> >     ```
> >         container->termination.fail(
> >             "Failed to kill the Docker container: " +
> >             (kill.isFailed() ? kill.failure() : "discarded future"));
> >     ```

I see. Also will come back to deal with deallocate issues after code rebased.
I moved this to `___destroy()`.
Also, added the error string for leaked GPUs:
```
    container->termination.fail(
        "Failed to kill the Docker container: " +
        (kill.isFailed() ? kill.failure() : "discarded future") +
        (container->gpus.empty() ?
            "" : ", " + stringify(container->gpus.size()) + "GPUs leaked"));
```


- Yubo


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/50841/#review147089
-----------------------------------------------------------


On 九月 20, 2016, 9:22 a.m., Yubo Li wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/50841/
> -----------------------------------------------------------
> 
> (Updated 九月 20, 2016, 9:22 a.m.)
> 
> 
> Review request for mesos, Benjamin Mahler, Guangya Liu, Kevin Klues, and Rajat Phull.
> 
> 
> Bugs: MESOS-5795
>     https://issues.apache.org/jira/browse/MESOS-5795
> 
> 
> Repository: mesos
> 
> 
> Description
> -------
> 
> Added control logic to allocate/deallocate GPUs to GPU-related task
> when the task is started/terminated.
> 
> 
> Diffs
> -----
> 
>   src/slave/containerizer/docker.hpp f2a06065cf99fed934c2c1ffc47461ec8a97f50d 
>   src/slave/containerizer/docker.cpp 5c1ee8e467d1c54c60b67dc5275ef71e1bb90723 
> 
> Diff: https://reviews.apache.org/r/50841/diff/
> 
> 
> Testing
> -------
> 
> make check
> 
> 
> Thanks,
> 
> Yubo Li
> 
>


Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message