mesos-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Benjamin Mahler (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (MESOS-8129) Very large resource value crashes master
Date Mon, 13 Nov 2017 21:41:00 GMT

    [ https://issues.apache.org/jira/browse/MESOS-8129?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16250290#comment-16250290
] 

Benjamin Mahler commented on MESOS-8129:
----------------------------------------

Thanks Bruce for the well written ticket. I think I could compute this empirically based on
adding 0.001 until I see the resultant skip a .001. Determining it more formally would take
me some time given the fractional component complicates matters over integers in doubles (I.e.
just use Number.MAX_SAFE_INTEGER).

Also, I'm curious what your use case is, can you tell me?

> Very large resource value crashes master
> ----------------------------------------
>
>                 Key: MESOS-8129
>                 URL: https://issues.apache.org/jira/browse/MESOS-8129
>             Project: Mesos
>          Issue Type: Bug
>          Components: agent, master
>    Affects Versions: 1.4.0
>         Environment: Ubuntu 14.04
> Both apt packages from Mesosphere repo and Docker images
>            Reporter: Bruce Merry
>            Assignee: Benjamin Mahler
>            Priority: Minor
>
> I ran into a master that kept failing on this CHECK when destroying a task:
> https://github.com/apache/mesos/blob/1.4.0/src/master/allocator/sorter/drf/sorter.hpp#L367
> I found that a combination of a misconfiguration and a suboptimal choice of units had
let to an agent with a custom scalar resource of capacity 4294967295000000. I believe what
is happening is the pseudo-fixed-point arithmetic isn't able to cope with such large numbers,
because rounding errors after arithmetic are bigger than 0.001. Examining the values in the
debugger that the CHECK failed due to a rounding error on the order of 0.2.
> While this is probably a fundamental limitation of the fixed-point implementation and
such large resource values are probably a bad idea, it would have helped if the agent had
complained on startup, rather than having to debug an internal assertion failure. I'd suggest
that values larger than, say, 10^12 should be rejected when the agent starts (which is why
I've added the agent component), although someone familiar with the details of the fixed-point
implementation should probably verify that number.
> I'm not sure where this needs to be fixed e.g. if it can just be validated on agent startup
or if it should be baked into the Resource class to prevent accidents in requests from the
user.
> To reproduce the issue, start a master and an agent with a custom scalar resource "thing:4294967295000000",
then use mesos-execute to throw the following task at it (it'll probably also work with a
smaller Docker image - that's just one I already had on the agent). When the sleep ends, the
master crashes.
> {code:javascript}
> {
>   "container": {
>     "docker": {
>       "image": "ubuntu:xenial-20161010"
>     }, 
>     "type": "DOCKER"
>   }, 
>   "name": "test-task", 
>   "task_id": {
>     "value": "00000001"
>   }, 
>   "command": {
>     "shell": false, 
>     "value": "sleep", 
>     "arguments": [
>       "10"
>     ]
>   }, 
>   "agent_id": {
>     "value": ""
>   }, 
>   "resources": [
>     {
>       "scalar": {
>         "value": 1
>       }, 
>       "type": "SCALAR", 
>       "name": "cpus"
>     }, 
>     {
>       "scalar": {
>         "value": 4106.0
>       }, 
>       "type": "SCALAR", 
>       "name": "mem"
>     }, 
>     {
>       "scalar": {
>         "value": 12465430.06012024
>       }, 
>       "type": "SCALAR", 
>       "name": "thing"
>     }
>   ]
> }
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

Mime
View raw message