Return-Path: X-Original-To: apmail-mesos-dev-archive@www.apache.org Delivered-To: apmail-mesos-dev-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 07A3D1058E for ; Mon, 31 Aug 2015 22:55:01 +0000 (UTC) Received: (qmail 94006 invoked by uid 500); 31 Aug 2015 22:55:00 -0000 Delivered-To: apmail-mesos-dev-archive@mesos.apache.org Received: (qmail 93926 invoked by uid 500); 31 Aug 2015 22:55:00 -0000 Mailing-List: contact dev-help@mesos.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@mesos.apache.org Delivered-To: mailing list dev@mesos.apache.org Received: (qmail 93872 invoked by uid 99); 31 Aug 2015 22:55:00 -0000 Received: from Unknown (HELO spamd4-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 31 Aug 2015 22:55:00 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd4-us-west.apache.org (ASF Mail Server at spamd4-us-west.apache.org) with ESMTP id CB098C0984 for ; Mon, 31 Aug 2015 22:54:59 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd4-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: 2.9 X-Spam-Level: ** X-Spam-Status: No, score=2.9 tagged_above=-999 required=6.31 tests=[DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, HTML_MESSAGE=3, SPF_PASS=-0.001, URIBL_BLOCKED=0.001] autolearn=disabled Authentication-Results: spamd4-us-west.apache.org (amavisd-new); dkim=pass (1024-bit key) header.d=ucsc.edu Received: from mx1-us-east.apache.org ([10.40.0.8]) by localhost (spamd4-us-west.apache.org [10.40.0.11]) (amavisd-new, port 10024) with ESMTP id xX8ZWkgVos7y for ; Mon, 31 Aug 2015 22:54:48 +0000 (UTC) Received: from mail-wi0-f175.google.com (mail-wi0-f175.google.com [209.85.212.175]) by mx1-us-east.apache.org (ASF Mail Server at mx1-us-east.apache.org) with ESMTPS id B437A42B5D for ; Mon, 31 Aug 2015 22:54:47 +0000 (UTC) Received: by wicmx12 with SMTP id mx12so12715167wic.0 for ; Mon, 31 Aug 2015 15:54:47 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=ucsc.edu; s=ucsc-google; h=mime-version:date:message-id:subject:from:to:content-type; bh=64Cf/yxYjw+a70DFKxZiLa4Ubyou4BvQWcv5ia/WCPE=; b=Xy9zEMoOcm+ZeKK8G5NCUV0bF7toA0/nTUbpYbtpiKUcs2DdIaozKRF5Db4hKxRsCK 76Lb6vgar5MfXCI2dsJDd2i2PQNLSqqNhlcHFt7+J/Vc6aUOqZjk9CNicpwk4tmRq1Hw ICtFIl9uK7HAIVCAAC34seBlRiAMAlfMEwo28= X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20130820; h=x-gm-message-state:mime-version:date:message-id:subject:from:to :content-type; bh=64Cf/yxYjw+a70DFKxZiLa4Ubyou4BvQWcv5ia/WCPE=; b=QaczEF/nyGtaMbamELSK2P/hywBqYGh/mdlOGmLYR6huJCKYTSpeZpvW+KJFHv/HCk Lp01K93g56NuY6yFuukN+fgbkgLOqqsxQ+kxLQwT3poj5jbw/F01IQs8ezJdHByP5pNH f0oO7THNLp5zzLbdCSqiDuBldyqGhExTECEOzGZhkyL+9BwJLvIyFQV/mKBGA+wplpwo 01bbJkT0/fdkW5BxgWR8bt/mlPwtc15RDxcHj0pyFTbdriesVJdCPCiwyVJHgX2O2DoF qyL6B+358nIK2WjEFQbvoWHcH7DVJ44Tfho/oipbQuV0adWHWcZ6ZF2TqU5TEkMmBt1m bYww== X-Gm-Message-State: ALoCoQmEHpez0OAMLyJ8aAlNSUYkV+vilrtuktJPTy5tuyDQXO+LT4CC+CncGKxoG27+STV2Y6KD MIME-Version: 1.0 X-Received: by 10.194.209.167 with SMTP id mn7mr28654942wjc.64.1441061686811; Mon, 31 Aug 2015 15:54:46 -0700 (PDT) Received: by 10.28.223.194 with HTTP; Mon, 31 Aug 2015 15:54:46 -0700 (PDT) Date: Mon, 31 Aug 2015 15:54:46 -0700 Message-ID: Subject: CPU soft lock up on mesos-slave From: Christopher Ketchum To: dev@mesos.apache.org Content-Type: multipart/alternative; boundary=047d7b3a8954f13b0b051ea351e6 --047d7b3a8954f13b0b051ea351e6 Content-Type: text/plain; charset=UTF-8 Hi all, I was running a Mesos cluster on EC2 with c4.8xlarge instance types when one of the status checks failed. We are running Mesos 0.22.1 on ubuntu 14.04, with kernel version 3.13.0-55-generic. EC2 gave us this console output[1]. I did some searching and found similar issues reported here[2] on lkml, though those logs indicated a specific task and an older kernel, while these logs just show mesos-slave as the causative process. Unfortunately, the instance was terminated so I'm not sure how much useful debugging can be done. Is this a known issue? We are also using a our own python executor, could an error there have caused this? [1] http://pastebin.com/NgHi8MnS [2] https://lkml.org/lkml/2014/9/30/498 Thanks, Chris --047d7b3a8954f13b0b051ea351e6--