mesos-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Vinod Kone (JIRA)" <j...@apache.org>
Subject [jira] [Assigned] (MESOS-8129) Very large resource value crashes master
Date Mon, 13 Nov 2017 20:14:00 GMT

     [ https://issues.apache.org/jira/browse/MESOS-8129?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Vinod Kone reassigned MESOS-8129:
---------------------------------

    Assignee: Benjamin Mahler

> Very large resource value crashes master
> ----------------------------------------
>
>                 Key: MESOS-8129
>                 URL: https://issues.apache.org/jira/browse/MESOS-8129
>             Project: Mesos
>          Issue Type: Bug
>          Components: agent, master
>    Affects Versions: 1.4.0
>         Environment: Ubuntu 14.04
> Both apt packages from Mesosphere repo and Docker images
>            Reporter: Bruce Merry
>            Assignee: Benjamin Mahler
>            Priority: Minor
>
> I ran into a master that kept failing on this CHECK when destroying a task:
> https://github.com/apache/mesos/blob/1.4.0/src/master/allocator/sorter/drf/sorter.hpp#L367
> I found that a combination of a misconfiguration and a suboptimal choice of units had
let to an agent with a custom scalar resource of capacity 4294967295000000. I believe what
is happening is the pseudo-fixed-point arithmetic isn't able to cope with such large numbers,
because rounding errors after arithmetic are bigger than 0.001. Examining the values in the
debugger that the CHECK failed due to a rounding error on the order of 0.2.
> While this is probably a fundamental limitation of the fixed-point implementation and
such large resource values are probably a bad idea, it would have helped if the agent had
complained on startup, rather than having to debug an internal assertion failure. I'd suggest
that values larger than, say, 10^12 should be rejected when the agent starts (which is why
I've added the agent component), although someone familiar with the details of the fixed-point
implementation should probably verify that number.
> I'm not sure where this needs to be fixed e.g. if it can just be validated on agent startup
or if it should be baked into the Resource class to prevent accidents in requests from the
user.
> To reproduce the issue, start a master and an agent with a custom scalar resource "thing:4294967295000000",
then use mesos-execute to throw the following task at it (it'll probably also work with a
smaller Docker image - that's just one I already had on the agent). When the sleep ends, the
master crashes.
> {code:javascript}
> {
>   "container": {
>     "docker": {
>       "image": "ubuntu:xenial-20161010"
>     }, 
>     "type": "DOCKER"
>   }, 
>   "name": "test-task", 
>   "task_id": {
>     "value": "00000001"
>   }, 
>   "command": {
>     "shell": false, 
>     "value": "sleep", 
>     "arguments": [
>       "10"
>     ]
>   }, 
>   "agent_id": {
>     "value": ""
>   }, 
>   "resources": [
>     {
>       "scalar": {
>         "value": 1
>       }, 
>       "type": "SCALAR", 
>       "name": "cpus"
>     }, 
>     {
>       "scalar": {
>         "value": 4106.0
>       }, 
>       "type": "SCALAR", 
>       "name": "mem"
>     }, 
>     {
>       "scalar": {
>         "value": 12465430.06012024
>       }, 
>       "type": "SCALAR", 
>       "name": "thing"
>     }
>   ]
> }
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

Mime
View raw message