Return-Path: X-Original-To: apmail-mesos-user-archive@www.apache.org Delivered-To: apmail-mesos-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 133D717468 for ; Fri, 26 Sep 2014 13:15:52 +0000 (UTC) Received: (qmail 45173 invoked by uid 500); 26 Sep 2014 13:15:51 -0000 Delivered-To: apmail-mesos-user-archive@mesos.apache.org Received: (qmail 45119 invoked by uid 500); 26 Sep 2014 13:15:51 -0000 Mailing-List: contact user-help@mesos.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@mesos.apache.org Delivered-To: mailing list user@mesos.apache.org Received: (qmail 45109 invoked by uid 99); 26 Sep 2014 13:15:51 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 26 Sep 2014 13:15:51 +0000 X-ASF-Spam-Status: No, hits=1.5 required=5.0 tests=HTML_MESSAGE,RCVD_IN_DNSWL_LOW,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (nike.apache.org: domain of barton.tomas@gmail.com designates 209.85.223.170 as permitted sender) Received: from [209.85.223.170] (HELO mail-ie0-f170.google.com) (209.85.223.170) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 26 Sep 2014 13:15:26 +0000 Received: by mail-ie0-f170.google.com with SMTP id x19so12850183ier.1 for ; Fri, 26 Sep 2014 06:15:24 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:in-reply-to:references:from:date:message-id:subject:to :content-type; bh=04zQGoXWYgTZuodJ00Q9uMICPLpZmML6OuatbiwyUac=; b=pP/vGM9iTArNLyuzpbqu3/BNjqoF27DWriVzAbbIZO5pBr+1pCnLgrz+5qYR2HQwHT A4flEao4RkzR6stNQzT+elVLNAzgs4L3E176nsTPpxiVMRNWx6y3h2m5l6aOT9tU3Roz IeJDJNq7oIdswTk3xRieNyfOWv8NsZcUeCGIZ850PNUfempGXDJo5KdiCIZcpogpEXlu On17Q6wey93WqM22cYVfAZhoBRcLSQXqwWYuNbqY1dQwWPgsp7RFS+YDbN7jJlKryoO3 enhUqib36nnpJy8LytusaRXQDAgTvPTksPD+V86pVig7e4+YOi0tQV6TsuKg1iMjLzvL XKnQ== X-Received: by 10.50.122.1 with SMTP id lo1mr17069039igb.5.1411737324508; Fri, 26 Sep 2014 06:15:24 -0700 (PDT) MIME-Version: 1.0 Received: by 10.107.15.92 with HTTP; Fri, 26 Sep 2014 06:15:04 -0700 (PDT) In-Reply-To: <54254C16.5010206@blue-yonder.com> References: <54254C16.5010206@blue-yonder.com> From: Tomas Barton Date: Fri, 26 Sep 2014 15:15:04 +0200 Message-ID: Subject: Re: Problems with OOM To: user Content-Type: multipart/alternative; boundary=089e01536c54bea69a0503f7b5b6 X-Virus-Checked: Checked by ClamAV on apache.org --089e01536c54bea69a0503f7b5b6 Content-Type: text/plain; charset=UTF-8 Just to make sure, all slaves are running with: --isolation='cgroups/cpu,cgroups/mem' Is there something suspicious in mesos slave logs? On 26 September 2014 13:20, Stephan Erb wrote: > Hi everyone, > > I am having issues with the cgroups isolation of Mesos. It seems like > tasks are prevented from allocating more memory than their limit. However, > they are never killed. > > - My scheduled task allocates memory in a tight loop. According to > 'ps', once its memory requirements are exceeded it is not killed, but ends > up in the state D ("uninterruptible sleep (usually IO)"). > - The task is still considered running by Mesos. > - There is no indication of an OOM in dmesg. > - There is neither an OOM notice nor any other output related to the > task in the slave log. > - According to htop, the system load is increased with a significant > portion of CPU time spend within the kernel. Commonly the load is so high > that all zookeeper connections time out. > > I am running Aurora and Mesos 0.20.1 using the cgroups isolation on Debian > 7 (kernel 3.2.60-1+deb7u3). . > > Sorry for the somewhat unspecific error description. Still, anyone an idea > what might be wrong here? > > Thanks and Best Regards, > Stephan > --089e01536c54bea69a0503f7b5b6 Content-Type: text/html; charset=UTF-8 Content-Transfer-Encoding: quoted-printable
Just to make sure, all slaves are running with:

--isolation=3D'cgroups/cpu,cgroups/mem'

=
Is there something suspicious in mesos slave logs?

On 26 September 201= 4 13:20, Stephan Erb <stephan.erb@blue-yonder.com>= wrote:
=20 =20 =20
Hi everyone,

I am having issues with the cgroups isolation of Mesos. It seems like tasks are prevented from allocating more memory than their limit. However, they are never killed.
  • My scheduled task allocates memory in a tight loop. According to 'ps', once its memory requirements are exceeded it is no= t killed, but ends up in the state D ("uninterruptible sleep (usually IO)").
  • The task is still considered running by Mesos.
  • There is no indication of an OOM in dmesg.
  • There is neither an OOM notice nor any other output related to the task in the slave log.
  • According to htop, the system load is increased with a significant portion of CPU time spend within the kernel. Commonly the load is so high that all zookeeper connections time out.
I am running Aurora and Mesos 0.20.1 using the cgroups isolation on Debian 7 (kernel 3.2.60-1+deb7u3). .

Sorry for the somewhat unspecific error description. Still, anyone an idea what might be wrong here?

Thanks and Best Regards,
Stephan

--089e01536c54bea69a0503f7b5b6--