From user-return-19726-archive-asf-public=cust-asf.ponee.io@flink.apache.org  Wed May  2 13:37:59 2018
Return-Path: <user-return-19726-archive-asf-public=cust-asf.ponee.io@flink.apache.org>
X-Original-To: archive-asf-public@cust-asf.ponee.io
Delivered-To: archive-asf-public@cust-asf.ponee.io
Received: from mail.apache.org (hermes.apache.org [140.211.11.3])
	by mx-eu-01.ponee.io (Postfix) with SMTP id 6154818065D
	for <archive-asf-public@cust-asf.ponee.io>; Wed,  2 May 2018 13:37:58 +0200 (CEST)
Received: (qmail 95577 invoked by uid 500); 2 May 2018 11:37:57 -0000
Mailing-List: contact user-help@flink.apache.org; run by ezmlm
Precedence: bulk
List-Help: <mailto:user-help@flink.apache.org>
List-Unsubscribe: <mailto:user-unsubscribe@flink.apache.org>
List-Post: <mailto:user@flink.apache.org>
List-Id: <user.flink.apache.org>
Delivered-To: mailing list user@flink.apache.org
Received: (qmail 95567 invoked by uid 99); 2 May 2018 11:37:56 -0000
Received: from pnap-us-west-generic-nat.apache.org (HELO spamd4-us-west.apache.org) (209.188.14.142)
    by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 02 May 2018 11:37:56 +0000
Received: from localhost (localhost [127.0.0.1])
	by spamd4-us-west.apache.org (ASF Mail Server at spamd4-us-west.apache.org) with ESMTP id 84468C04BF
	for <user@flink.apache.org>; Wed,  2 May 2018 11:37:56 +0000 (UTC)
X-Virus-Scanned: Debian amavisd-new at spamd4-us-west.apache.org
X-Spam-Flag: NO
X-Spam-Score: 1.998
X-Spam-Level: *
X-Spam-Status: No, score=1.998 tagged_above=-999 required=6.31
	tests=[DKIM_SIGNED=0.1, DKIM_VALID=-0.1, HTML_MESSAGE=2,
	RCVD_IN_DNSWL_NONE=-0.0001, RCVD_IN_MSPIKE_H2=-0.001, SPF_PASS=-0.001]
	autolearn=disabled
Authentication-Results: spamd4-us-west.apache.org (amavisd-new);
	dkim=pass (2048-bit key)
	header.d=data-artisans-com.20150623.gappssmtp.com
Received: from mx1-lw-us.apache.org ([10.40.0.8])
	by localhost (spamd4-us-west.apache.org [10.40.0.11]) (amavisd-new, port 10024)
	with ESMTP id y5UY82nzFc3r for <user@flink.apache.org>;
	Wed,  2 May 2018 11:37:54 +0000 (UTC)
Received: from mail-wr0-f179.google.com (mail-wr0-f179.google.com [209.85.128.179])
	by mx1-lw-us.apache.org (ASF Mail Server at mx1-lw-us.apache.org) with ESMTPS id 7DA4D5F1F0
	for <user@flink.apache.org>; Wed,  2 May 2018 11:37:53 +0000 (UTC)
Received: by mail-wr0-f179.google.com with SMTP id f2-v6so1844923wrm.3
        for <user@flink.apache.org>; Wed, 02 May 2018 04:37:53 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=data-artisans-com.20150623.gappssmtp.com; s=20150623;
        h=from:message-id:mime-version:subject:date:in-reply-to:cc:to
         :references;
        bh=GMAU/busLkV4gijPuGSwsX07awhVX1evRDhMZgU+QgQ=;
        b=ihdYpr6wKDQ5orlfKFdPqFoQD8dXM9iHLSiAHO+Jk77TVujQDpMhq/QOHg7MXRsnyR
         1KqvIEd2wj0H8nq0fy3jyflPAFlCIo3zGlQEvFdn7fdeiXdhmCmV4lIcBbjGeeaPDbCF
         aWK++SrZi6O86Tm3Cz+ABSRRd01IlHLMGuUS3zrikxy63g/TngeeP1sW/FipWM7bUaU+
         HwDlJMaqHl2WEEtAsQpV4GpjtP6hhY8qqvN87k1FHw17JewTTp2mAv9HepejJbR3oEX8
         q9JPyViBPWiKwcGGovvgAspLkLm7f4e2U7PkqPQiquFtTGNX8nA7xmEdIrFB70EDizDQ
         qufw==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20161025;
        h=x-gm-message-state:from:message-id:mime-version:subject:date
         :in-reply-to:cc:to:references;
        bh=GMAU/busLkV4gijPuGSwsX07awhVX1evRDhMZgU+QgQ=;
        b=Jjaf6WdebvQtlt6f2ohpdCeGfmj7qD54orRcVRARoa84svW1xn/6/deBRrozdQRytu
         CA1VnRymTVSEp6mzTXlwVRLmOLQemWKtOgEnCI2T2Nggy0BpON68Tq8sLP7fu1PBmY7H
         jjt0/ezzZP1E0Y0uXx0yHmHncxrFyJBFtNoqZebKTYRHoSoQZPIusr2xfqbJi8x/e0az
         n7SCIAyqG9xhx3wFEG/6uVkAH9e6jRDcXHdbBGTF5dAyqYFF0BesFEt3r9fZWWCsj5mw
         gz5L6tXmjQ/3Oknd6VYfNNSDmczGlaRHNRTGjHwm7kRK59v2fH9FYs8t6mEK1VbdtZub
         GjQA==
X-Gm-Message-State: ALQs6tAOnHpWfFNurysnfm83sZ8Tz3ATyT+g9hjR4pD/JeJTBf3yga0U
	x+6O5+OLdIG8Ma6otuUZlL0yNA==
X-Google-Smtp-Source: AB8JxZoAGw87jyUFM6oV6KX6pVQ+e6HjK1QdiUSkiLLmpRgqB7gsksCSiH6bLcuqZyV+cCWE8W1EbQ==
X-Received: by 2002:adf:c412:: with SMTP id v18-v6mr15552105wrf.20.1525261072521;
        Wed, 02 May 2018 04:37:52 -0700 (PDT)
Received: from piotrs-mbp.fritz.box (dslb-002-205-086-147.002.205.pools.vodafone-ip.de. [2.205.86.147])
        by smtp.gmail.com with ESMTPSA id x128sm3454007wmg.2.2018.05.02.04.37.51
        (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128);
        Wed, 02 May 2018 04:37:52 -0700 (PDT)
From: Piotr Nowojski <piotr@data-artisans.com>
Message-Id: <A7B3F391-A207-4851-95EF-238AC69948B3@data-artisans.com>
Content-Type: multipart/alternative;
	boundary="Apple-Mail=_6A35FBB2-CA34-45CC-9F2E-9A8DCC4E514B"
Mime-Version: 1.0 (Mac OS X Mail 11.3 \(3445.6.18\))
Subject: Re: Odd job failure
Date: Wed, 2 May 2018 13:37:50 +0200
In-Reply-To: <CAFDmHdMr8Tdt-6yt5a9j=ULkXBO4vRK5CLbt8c7UFZq9+3GMDA@mail.gmail.com>
Cc: user <user@flink.apache.org>
To: Elias Levy <fearsome.lucidity@gmail.com>
References: <CAFDmHdMr8Tdt-6yt5a9j=ULkXBO4vRK5CLbt8c7UFZq9+3GMDA@mail.gmail.com>
X-Mailer: Apple Mail (2.3445.6.18)


--Apple-Mail=_6A35FBB2-CA34-45CC-9F2E-9A8DCC4E514B
Content-Transfer-Encoding: quoted-printable
Content-Type: text/plain;
	charset=utf-8

Hi,

It might be some Kafka issue.=20

=46rom what you described your reasoning seems sound. For some reason =
TM3 fails and is unable to restart and process any data, thus forcing =
spilling on checkpoint barriers on TM1 and TM2.

I don=E2=80=99t know the reason behind java.lang.NoClassDefFoundError: =
org/apache/kafka/clients/NetworkClient$1 errors, but it doesn=E2=80=99t =
seem to be important in this case.

1. What Kafka version are you using? Have you looked for any known Kafka =
issues with those symptoms?
2. Maybe the easiest thing will be to reformat/reinstall/recreate TM3 =
AWS image? It might be some system issue.

Piotrek

> On 28 Apr 2018, at 01:54, Elias Levy <fearsome.lucidity@gmail.com> =
wrote:
>=20
> We had a job on a Flink 1.4.2 cluster with three TMs experience an odd =
failure the other day.  It seems that it started as some sort of network =
event. =20
>=20
> It began with the 3rd TM starting to warn every 30 seconds about =
socket timeouts while sending metrics to DataDog.  This latest for the =
whole outage.
>=20
> Twelve minutes later, all TMs reported at nearly the same time that =
they had marked the Kafka coordinator as deed ("Marking the coordinator =
XXX (id: 2147482640 rack: null) dead for group ZZZ").  The job =
terminated and the system attempted to recover it.  Then things got into =
a weird state.
>=20
> The following related for six or seven times for a period of about 40 =
minutes:=20
> TM attempts to restart the job, but only the first and second TMs show =
signs of doing so. =20
> The disk begins to fill up on TMs 1 and 2. =20
> TMs 1 & 2 both report java.lang.NoClassDefFoundError: =
org/apache/kafka/clients/NetworkClient$1 errors.  These were mentioned =
on this list earlier this month.  It is unclear if the are benign.
> The job dies when the disks finally fills up on 1 and 2.
>=20
> Looking at the backtrace logged when the disk fills up, I gather that =
Flink is buffering data coming from Kafka into one of my operators as a =
result of a barrier.  The job has a two input operator, with one input =
the primary data, and a secondary input for control commands.  It would =
appear that for whatever reason the barrier for the control stream is =
not making it to the operator, thus leading to the buffering and full =
disks.  Maybe Flink scheduled the operator source of the control stream =
on the 3rd TM which seems like it was not scheduling tasks?
>=20
> Finally the JM records that it 13 late messages for already expired =
checkpoints (could they be from the 3rd TM?), the job is restored one =
more time and it works.  All TMs report nearly at the same time that =
they can now find the Kafka coordinator.
>=20
>=20
> Seems like the 3rd TM has some connectivity issue, but then all TMs =
seems to have a problem communicating with the Kafka coordinator at the =
same time and recovered at the same time.  The TMs are hosted in AWS =
across AZs, so all of them having connectivity issues at the same time =
is suspect.  The Kafka node in question was up and other clients in our =
infrastructure seems to be able to communicate with it without trouble.  =
Also, the Flink job itself seemed to be talking to the Kafka cluster =
while restarting as it was spilling data to disk coming from Kafka.  And =
the JM did not report any reduction on available task slots, which would =
indicate connectivity issues between the JM and the 3rd TM.  Yet, the =
logs in the 3rd TM do not show any record of trying to restore the job =
during the intermediate attempts.
>=20
> What do folks make of it?
>=20
>=20
> And a question for Flink devs, is there some reason why Flink does not =
stop spilling messages to disk when the disk is going to fill up?  Seems =
like there should be a configurable limit to how much data can be =
spilled before back-pressure is applied to slow down or stop the source.


--Apple-Mail=_6A35FBB2-CA34-45CC-9F2E-9A8DCC4E514B
Content-Transfer-Encoding: quoted-printable
Content-Type: text/html;
	charset=utf-8

<html><head><meta http-equiv=3D"Content-Type" content=3D"text/html; =
charset=3Dutf-8"></head><body style=3D"word-wrap: break-word; =
-webkit-nbsp-mode: space; line-break: after-white-space;" =
class=3D"">Hi,<div class=3D""><br class=3D""></div><div class=3D"">It =
might be some Kafka issue.&nbsp;</div><div class=3D""><br =
class=3D""></div><div class=3D"">=46rom what you described your =
reasoning seems sound. For some reason TM3 fails and is unable to =
restart and process any data, thus forcing spilling on checkpoint =
barriers on TM1 and TM2.</div><div class=3D""><br class=3D""></div><div =
class=3D"">I don=E2=80=99t know the reason behind =
java.lang.NoClassDefFoundError: org/apache/kafka/clients/NetworkClient$1 =
errors, but it doesn=E2=80=99t seem to be important in this =
case.</div><div class=3D""><br class=3D""></div><div class=3D"">1. What =
Kafka version are you using? Have you looked for any known Kafka issues =
with those symptoms?</div><div class=3D"">2. Maybe the easiest thing =
will be to reformat/reinstall/recreate TM3 AWS image? It might be some =
system issue.</div><div class=3D""><br class=3D""></div><div =
class=3D"">Piotrek</div><div class=3D""><br class=3D""></div><div =
class=3D""><div><blockquote type=3D"cite" class=3D""><div class=3D"">On =
28 Apr 2018, at 01:54, Elias Levy &lt;<a =
href=3D"mailto:fearsome.lucidity@gmail.com" =
class=3D"">fearsome.lucidity@gmail.com</a>&gt; wrote:</div><br =
class=3D"Apple-interchange-newline"><div class=3D""><div dir=3D"ltr" =
class=3D"">We had a job on a Flink 1.4.2 cluster with three TMs =
experience an odd failure the other day.&nbsp; It seems that it started =
as some sort of network event. &nbsp;<div class=3D""><br =
class=3D""></div><div class=3D"">It began with the 3rd TM starting to =
warn every 30 seconds about socket timeouts while sending metrics to =
DataDog.&nbsp; This latest for the whole outage.</div><div class=3D""><br =
class=3D""></div><div class=3D"">Twelve minutes later, all TMs reported =
at nearly the same time that they had marked the Kafka coordinator as =
deed ("Marking the coordinator XXX (id: 2147482640 rack: null) dead for =
group ZZZ").&nbsp; The job terminated and the system attempted to =
recover it.&nbsp; Then things got into a weird state.</div><div =
class=3D""><br class=3D""></div><div class=3D"">The following related =
for six or seven times for a period of about 40 minutes:&nbsp;</div><div =
class=3D""><ol class=3D""><li class=3D"">TM attempts to restart the job, =
but only the first and second TMs show signs of doing so. &nbsp;</li><li =
class=3D"">The disk begins to fill up on TMs 1 and 2. &nbsp;</li><li =
class=3D"">TMs 1 &amp; 2 both report java.lang.NoClassDefFoundError: =
org/apache/kafka/clients/NetworkClient$1 errors.&nbsp; These were =
mentioned on this list earlier this month.&nbsp; It is unclear if the =
are benign.</li><li class=3D"">The job dies when the disks finally fills =
up on 1 and 2.<br class=3D""></li></ol><div class=3D""><br =
class=3D""></div></div><div class=3D"">Looking at the backtrace logged =
when the disk fills up, I gather that Flink is buffering data coming =
from Kafka into one of my operators as a result of a barrier.&nbsp; The =
job has a two input operator, with one input the primary data, and a =
secondary input for control commands.&nbsp; It would appear that for =
whatever reason the barrier for the control stream is not making it to =
the operator, thus leading to the buffering and full disks.&nbsp; Maybe =
Flink scheduled the operator source of the control stream on the 3rd TM =
which seems like it was not scheduling tasks?</div><div class=3D""><br =
class=3D""></div><div class=3D"">Finally the JM records that it&nbsp;13 =
late messages for already expired checkpoints (could they be from the =
3rd TM?), the job is restored one more time and it works.&nbsp; All TMs =
report nearly at the same time that they can now find the Kafka =
coordinator.</div><div class=3D""><br class=3D""></div><div class=3D""><br=
 class=3D""></div><div class=3D"">Seems like the 3rd TM has some =
connectivity issue, but then all TMs seems to have a problem =
communicating with the Kafka coordinator at the same time and recovered =
at the same time.&nbsp; The TMs are hosted in AWS across AZs, so all of =
them having connectivity issues at the same time is suspect.&nbsp; The =
Kafka node in question was up and other clients in our infrastructure =
seems to be able to communicate with it without trouble.&nbsp; Also, the =
Flink job itself seemed to be talking to the Kafka cluster while =
restarting as it was spilling data to disk coming from Kafka.&nbsp; And =
the JM did not report any reduction on available task slots, which would =
indicate connectivity issues between the JM and the 3rd TM.&nbsp; Yet, =
the logs in the 3rd TM do not show any record of trying to restore the =
job during the intermediate attempts.</div><div class=3D""><br =
class=3D""></div><div class=3D""><div class=3D"">What do folks make of =
it?</div></div><div class=3D""><br class=3D""></div><div class=3D""><br =
class=3D""></div><div class=3D"">And a question for Flink devs, is there =
some reason why Flink does not stop spilling messages to disk when the =
disk is going to fill up?&nbsp; Seems like there should be a =
configurable limit to how much data can be spilled before back-pressure =
is applied to slow down or stop the source.</div></div>
</div></blockquote></div><br class=3D""></div></body></html>=

--Apple-Mail=_6A35FBB2-CA34-45CC-9F2E-9A8DCC4E514B--