Return-Path: X-Original-To: archive-asf-public-internal@cust-asf2.ponee.io Delivered-To: archive-asf-public-internal@cust-asf2.ponee.io Received: from cust-asf.ponee.io (cust-asf.ponee.io [163.172.22.183]) by cust-asf2.ponee.io (Postfix) with ESMTP id 82C84200C7F for ; Wed, 24 May 2017 14:32:39 +0200 (CEST) Received: by cust-asf.ponee.io (Postfix) id 8148B160BB4; Wed, 24 May 2017 12:32:39 +0000 (UTC) Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by cust-asf.ponee.io (Postfix) with SMTP id A1057160B9C for ; Wed, 24 May 2017 14:32:38 +0200 (CEST) Received: (qmail 98623 invoked by uid 500); 24 May 2017 12:32:37 -0000 Mailing-List: contact user-help@flink.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Delivered-To: mailing list user@flink.apache.org Received: (qmail 98613 invoked by uid 99); 24 May 2017 12:32:37 -0000 Received: from pnap-us-west-generic-nat.apache.org (HELO spamd1-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 24 May 2017 12:32:37 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd1-us-west.apache.org (ASF Mail Server at spamd1-us-west.apache.org) with ESMTP id 2430CCF79B for ; Wed, 24 May 2017 12:32:37 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd1-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: -0.796 X-Spam-Level: X-Spam-Status: No, score=-0.796 tagged_above=-999 required=6.31 tests=[DKIM_SIGNED=0.1, DKIM_VALID=-0.1, HTML_MESSAGE=2, RCVD_IN_DNSWL_NONE=-0.0001, RCVD_IN_MSPIKE_H2=-2.796] autolearn=disabled Authentication-Results: spamd1-us-west.apache.org (amavisd-new); dkim=pass (2048-bit key) header.d=okkam-it.20150623.gappssmtp.com Received: from mx1-lw-eu.apache.org ([10.40.0.8]) by localhost (spamd1-us-west.apache.org [10.40.0.7]) (amavisd-new, port 10024) with ESMTP id dTAMs9fjWj8K for ; Wed, 24 May 2017 12:32:34 +0000 (UTC) Received: from mail-vk0-f47.google.com (mail-vk0-f47.google.com [209.85.213.47]) by mx1-lw-eu.apache.org (ASF Mail Server at mx1-lw-eu.apache.org) with ESMTPS id E41755FC57 for ; Wed, 24 May 2017 12:32:33 +0000 (UTC) Received: by mail-vk0-f47.google.com with SMTP id h16so74158655vkd.2 for ; Wed, 24 May 2017 05:32:33 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=okkam-it.20150623.gappssmtp.com; s=20150623; h=mime-version:in-reply-to:references:from:date:message-id:subject:to :cc; bh=9Gs0RBUf8NJM812S39W3H8Vf/EBlH1PQ1Hp531JK98s=; b=DI9R6DTr8wm3ny0gDuMRxho4OaxgPvwHJuHo5laphbE5Mos3m+NmXPYVv3lIaEfxQI AZFnoQ25IJNlYHdwI8j36IB7Jp3iHNyLsoaYfErxmwXRpe3k0g7nHUQQSiCadtyl53z7 77xCr1glKidoFsNwSEDjofOww8LwozbRdqc1tL6nv/b8M3L0Uu9p05QS61s21I3/9oPp XCPGP5wwPmIPbfPyhbTsivcfr0FSEGwKp3W7FjgjMh6feBY1icSjjD3Fe3zewCfvpiZY oioWrYTrY8qtDv9owDmdKEjUjhReNaYgyeCAENvE/1605Uc+cj8z3T4gzd9a/sC9noP3 P5RA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:in-reply-to:references:from:date :message-id:subject:to:cc; bh=9Gs0RBUf8NJM812S39W3H8Vf/EBlH1PQ1Hp531JK98s=; b=DeHlW1H30QgPrxUGXrEcFkqcNZfbxmpka5RjY6JLuzMF/uuMDg0+hZvntkiOYFj0hV BaTDUpb2CD7I69v/2S8addNVIOzFws8R0UZL0z8rieIquV6qmR5lNk+peBVJCsoo9UCs TWNwgkgUQUz1r+u7PGtbdXsUbIrjwV+1MvSVCfwe4n9pe8qAL/pZvcpYE3+1NUGC96eB DfomR3JbOMY0TazeQLTrJuQfs3r04/X9CCH8oZAvnx0c1xzBJFmQGckguzT/fSFDh77y F/7UlWXotj5lLMn8ji5cjqjnXE6h3Zvfe9ex/DniPToEA0ZbB1cKRSKJiLifUOG0BZUu Ps9g== X-Gm-Message-State: AODbwcBx5QgSFz4rwsIJ/2LNEE40QnxFSiIKQMf59XCB4d9L4maWRjWv 7h7GfI02h/EVg9ETMn5lt4L/gsjehQZ28Zo= X-Received: by 10.31.148.68 with SMTP id w65mr14327241vkd.87.1495629152652; Wed, 24 May 2017 05:32:32 -0700 (PDT) MIME-Version: 1.0 Received: by 10.31.52.145 with HTTP; Wed, 24 May 2017 05:32:12 -0700 (PDT) X-Originating-IP: [77.43.114.114] In-Reply-To: References: From: Flavio Pompermaier Date: Wed, 24 May 2017 14:32:12 +0200 Message-ID: Subject: Re: Flink and swapping question To: Greg Hogan Cc: user Content-Type: multipart/alternative; boundary="001a11425af85c6f940550444c8d" archived-at: Wed, 24 May 2017 12:32:39 -0000 --001a11425af85c6f940550444c8d Content-Type: text/plain; charset="UTF-8" Hi Greg, I carefully monitored all TM memory with jstat -gcutil and there'no full gc, only . The initial situation on the dying TM is: S0 S1 E O M CCS YGC YGCT FGC FGCT GCT 0.00 100.00 33.57 88.74 98.42 97.17 159 2.508 1 0.255 2.763 0.00 100.00 90.14 88.80 98.67 97.17 197 2.617 1 0.255 2.873 0.00 100.00 27.00 88.82 98.75 97.17 234 2.730 1 0.255 2.986 After about 10 hours of processing is: 0.00 100.00 21.74 83.66 98.52 96.94 5519 33.011 1 0.255 33.267 0.00 100.00 21.74 83.66 98.52 96.94 5519 33.011 1 0.255 33.267 0.00 100.00 21.74 83.66 98.52 96.94 5519 33.011 1 0.255 33.267 So I don't think thta OOM could be an option. However, the cluster is running on ESXi vSphere VMs and we already experienced unexpected crash of jobs because of ESXi moving a heavy-loaded VM to another (less loaded) physical machine..I would't be surprised if swapping is also handled somehow differently.. Looking at Cloudera widgets I see that the crash is usually preceded by an intense cpu_iowait period. I fear that Flink unsafe access to memory could be a problem in those scenarios. Am I wrong? Any insight or debugging technique is greatly appreciated. Best, Flavio On Wed, May 24, 2017 at 2:11 PM, Greg Hogan wrote: > Hi Flavio, > > Flink handles interrupts so the only silent killer I am aware of is > Linux's OOM killer. Are you seeing such a message in dmesg? > > Greg > > On Wed, May 24, 2017 at 3:18 AM, Flavio Pompermaier > wrote: > >> Hi to all, >> I'd like to know whether memory swapping could cause a taskmanager crash. >> In my cluster of virtual machines 'm seeing this strange behavior in my >> Flink cluster: sometimes, if memory get swapped the taskmanager (on that >> machine) dies unexpectedly without any log about the error. >> >> Is that possible or not? >> >> Best, >> Flavio >> > > --001a11425af85c6f940550444c8d Content-Type: text/html; charset="UTF-8" Content-Transfer-Encoding: quoted-printable
Hi Greg,
I carefully monitored all TM memory with jsta= t -gcutil and there'no full gc, only .
The initial situation on the= dying TM is:

=C2=A0 S0 =C2=A0 =C2=A0 S1 =C2= =A0 =C2=A0 E =C2=A0 =C2=A0 =C2=A0O =C2=A0 =C2=A0 =C2=A0M =C2=A0 =C2=A0 CCS = =C2=A0 =C2=A0YGC =C2=A0 =C2=A0 YGCT =C2=A0 =C2=A0FGC =C2=A0 =C2=A0FGCT =C2= =A0 =C2=A0 GCT =C2=A0=C2=A0
=C2=A0 0.00 100.00 =C2=A033.57 =C2=A0= 88.74 =C2=A098.42 =C2=A097.17 =C2=A0 =C2=A0159 =C2=A0 =C2=A02.508 =C2=A0 = =C2=A0 1 =C2=A0 =C2=A00.255 =C2=A0 =C2=A02.763
=C2=A0 0.00 100.00= =C2=A090.14 =C2=A088.80 =C2=A098.67 =C2=A097.17 =C2=A0 =C2=A0197 =C2=A0 = =C2=A02.617 =C2=A0 =C2=A0 1 =C2=A0 =C2=A00.255 =C2=A0 =C2=A02.873
=C2=A0 0.00 100.00 =C2=A027.00 =C2=A088.82 =C2=A098.75 =C2=A097.17 =C2=A0 = =C2=A0234 =C2=A0 =C2=A02.730 =C2=A0 =C2=A0 1 =C2=A0 =C2=A00.255 =C2=A0 =C2= =A02.986

After about 10 hours of processing = is:

=C2=A0 0.00 100.00 =C2=A021.74 =C2=A0= 83.66 =C2=A098.52 =C2=A096.94 =C2=A0 5519 =C2=A0 33.011 =C2=A0 =C2=A0 1 =C2= =A0 =C2=A00.255 =C2=A0 33.267
=C2=A0 0.00 100.00 =C2=A021.74 =C2= =A083.66 =C2=A098.52 =C2=A096.94 =C2=A0 5519 =C2=A0 33.011 =C2=A0 =C2=A0 1 = =C2=A0 =C2=A00.255 =C2=A0 33.267
=C2=A0 0.00 100.00 =C2=A021.74 = =C2=A083.66 =C2=A098.52 =C2=A096.94 =C2=A0 5519 =C2=A0 33.011 =C2=A0 =C2=A0= 1 =C2=A0 =C2=A00.255 =C2=A0 33.267

So I don= 't think thta OOM could be an option.

However,= the cluster is running on ESXi vSphere VMs and we already experienced unex= pected crash of jobs because of ESXi moving a heavy-loaded VM to another (l= ess loaded)=C2=A0physical machine..I would't be surprised if swapping i= s also handled somehow differently..
Looking at Cloudera widgets = I see that the crash is usually preceded by an intense cpu_iowait period.
I fear that Flink unsafe access to memory could be a problem in th= ose scenarios. Am I wrong?

Any insight or debuggin= g technique is =C2=A0greatly appreciated.
Best,
Flavio<= /div>


On Wed, May 24, 2017 at 2:11 PM, Greg Hogan <code@greghogan.com>= ; wrote:
Hi = Flavio,

Flink handles interrupts so the only silent kille= r I am aware of is Linux's OOM killer. Are you seeing such a message in= dmesg?

Greg

On Wed, May 24, 2017 at 3:18 AM, F= lavio Pompermaier <pompermaier@okkam.it> wrote:
Hi to all,
I'd like to kn= ow whether memory swapping could cause a taskmanager crash.=C2=A0
In my cluster of virtual machines 'm seeing this strange behavior in m= y Flink cluster: sometimes, if memory get swapped the taskmanager (on that = machine) dies unexpectedly without any log about the error.
<= div>
Is that possible or not?

Best,<= /div>
Flavio


--001a11425af85c6f940550444c8d--