Return-Path: X-Original-To: archive-asf-public-internal@cust-asf2.ponee.io Delivered-To: archive-asf-public-internal@cust-asf2.ponee.io Received: from cust-asf.ponee.io (cust-asf.ponee.io [163.172.22.183]) by cust-asf2.ponee.io (Postfix) with ESMTP id 34700200D3D for ; Mon, 13 Nov 2017 22:41:05 +0100 (CET) Received: by cust-asf.ponee.io (Postfix) id 32EC7160C06; Mon, 13 Nov 2017 21:41:05 +0000 (UTC) Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by cust-asf.ponee.io (Postfix) with SMTP id 774BA160BF0 for ; Mon, 13 Nov 2017 22:41:04 +0100 (CET) Received: (qmail 30542 invoked by uid 500); 13 Nov 2017 21:41:03 -0000 Mailing-List: contact issues-help@mesos.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@mesos.apache.org Delivered-To: mailing list issues@mesos.apache.org Received: (qmail 30530 invoked by uid 99); 13 Nov 2017 21:41:03 -0000 Received: from pnap-us-west-generic-nat.apache.org (HELO spamd1-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 13 Nov 2017 21:41:03 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd1-us-west.apache.org (ASF Mail Server at spamd1-us-west.apache.org) with ESMTP id D76E4C3DC0 for ; Mon, 13 Nov 2017 21:41:02 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd1-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: -99.202 X-Spam-Level: X-Spam-Status: No, score=-99.202 tagged_above=-999 required=6.31 tests=[KAM_ASCII_DIVIDERS=0.8, RP_MATCHES_RCVD=-0.001, SPF_PASS=-0.001, USER_IN_WHITELIST=-100] autolearn=disabled Received: from mx1-lw-us.apache.org ([10.40.0.8]) by localhost (spamd1-us-west.apache.org [10.40.0.7]) (amavisd-new, port 10024) with ESMTP id xmMihH-E5Jtv for ; Mon, 13 Nov 2017 21:41:02 +0000 (UTC) Received: from mailrelay1-us-west.apache.org (mailrelay1-us-west.apache.org [209.188.14.139]) by mx1-lw-us.apache.org (ASF Mail Server at mx1-lw-us.apache.org) with ESMTP id BD9175FE0E for ; Mon, 13 Nov 2017 21:41:01 +0000 (UTC) Received: from jira-lw-us.apache.org (unknown [207.244.88.139]) by mailrelay1-us-west.apache.org (ASF Mail Server at mailrelay1-us-west.apache.org) with ESMTP id E4491E0E34 for ; Mon, 13 Nov 2017 21:41:00 +0000 (UTC) Received: from jira-lw-us.apache.org (localhost [127.0.0.1]) by jira-lw-us.apache.org (ASF Mail Server at jira-lw-us.apache.org) with ESMTP id 48773240DE for ; Mon, 13 Nov 2017 21:41:00 +0000 (UTC) Date: Mon, 13 Nov 2017 21:41:00 +0000 (UTC) From: "Benjamin Mahler (JIRA)" To: issues@mesos.apache.org Message-ID: In-Reply-To: References: Subject: [jira] [Commented] (MESOS-8129) Very large resource value crashes master MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 archived-at: Mon, 13 Nov 2017 21:41:05 -0000 [ https://issues.apache.org/jira/browse/MESOS-8129?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16250290#comment-16250290 ] Benjamin Mahler commented on MESOS-8129: ---------------------------------------- Thanks Bruce for the well written ticket. I think I could compute this empirically based on adding 0.001 until I see the resultant skip a .001. Determining it more formally would take me some time given the fractional component complicates matters over integers in doubles (I.e. just use Number.MAX_SAFE_INTEGER). Also, I'm curious what your use case is, can you tell me? > Very large resource value crashes master > ---------------------------------------- > > Key: MESOS-8129 > URL: https://issues.apache.org/jira/browse/MESOS-8129 > Project: Mesos > Issue Type: Bug > Components: agent, master > Affects Versions: 1.4.0 > Environment: Ubuntu 14.04 > Both apt packages from Mesosphere repo and Docker images > Reporter: Bruce Merry > Assignee: Benjamin Mahler > Priority: Minor > > I ran into a master that kept failing on this CHECK when destroying a task: > https://github.com/apache/mesos/blob/1.4.0/src/master/allocator/sorter/drf/sorter.hpp#L367 > I found that a combination of a misconfiguration and a suboptimal choice of units had let to an agent with a custom scalar resource of capacity 4294967295000000. I believe what is happening is the pseudo-fixed-point arithmetic isn't able to cope with such large numbers, because rounding errors after arithmetic are bigger than 0.001. Examining the values in the debugger that the CHECK failed due to a rounding error on the order of 0.2. > While this is probably a fundamental limitation of the fixed-point implementation and such large resource values are probably a bad idea, it would have helped if the agent had complained on startup, rather than having to debug an internal assertion failure. I'd suggest that values larger than, say, 10^12 should be rejected when the agent starts (which is why I've added the agent component), although someone familiar with the details of the fixed-point implementation should probably verify that number. > I'm not sure where this needs to be fixed e.g. if it can just be validated on agent startup or if it should be baked into the Resource class to prevent accidents in requests from the user. > To reproduce the issue, start a master and an agent with a custom scalar resource "thing:4294967295000000", then use mesos-execute to throw the following task at it (it'll probably also work with a smaller Docker image - that's just one I already had on the agent). When the sleep ends, the master crashes. > {code:javascript} > { > "container": { > "docker": { > "image": "ubuntu:xenial-20161010" > }, > "type": "DOCKER" > }, > "name": "test-task", > "task_id": { > "value": "00000001" > }, > "command": { > "shell": false, > "value": "sleep", > "arguments": [ > "10" > ] > }, > "agent_id": { > "value": "" > }, > "resources": [ > { > "scalar": { > "value": 1 > }, > "type": "SCALAR", > "name": "cpus" > }, > { > "scalar": { > "value": 4106.0 > }, > "type": "SCALAR", > "name": "mem" > }, > { > "scalar": { > "value": 12465430.06012024 > }, > "type": "SCALAR", > "name": "thing" > } > ] > } > {code} -- This message was sent by Atlassian JIRA (v6.4.14#64029)