From user-return-19726-archive-asf-public=cust-asf.ponee.io@flink.apache.org Wed May 2 13:37:59 2018 Return-Path: X-Original-To: archive-asf-public@cust-asf.ponee.io Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by mx-eu-01.ponee.io (Postfix) with SMTP id 6154818065D for ; Wed, 2 May 2018 13:37:58 +0200 (CEST) Received: (qmail 95577 invoked by uid 500); 2 May 2018 11:37:57 -0000 Mailing-List: contact user-help@flink.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Delivered-To: mailing list user@flink.apache.org Received: (qmail 95567 invoked by uid 99); 2 May 2018 11:37:56 -0000 Received: from pnap-us-west-generic-nat.apache.org (HELO spamd4-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 02 May 2018 11:37:56 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd4-us-west.apache.org (ASF Mail Server at spamd4-us-west.apache.org) with ESMTP id 84468C04BF for ; Wed, 2 May 2018 11:37:56 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd4-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: 1.998 X-Spam-Level: * X-Spam-Status: No, score=1.998 tagged_above=-999 required=6.31 tests=[DKIM_SIGNED=0.1, DKIM_VALID=-0.1, HTML_MESSAGE=2, RCVD_IN_DNSWL_NONE=-0.0001, RCVD_IN_MSPIKE_H2=-0.001, SPF_PASS=-0.001] autolearn=disabled Authentication-Results: spamd4-us-west.apache.org (amavisd-new); dkim=pass (2048-bit key) header.d=data-artisans-com.20150623.gappssmtp.com Received: from mx1-lw-us.apache.org ([10.40.0.8]) by localhost (spamd4-us-west.apache.org [10.40.0.11]) (amavisd-new, port 10024) with ESMTP id y5UY82nzFc3r for ; Wed, 2 May 2018 11:37:54 +0000 (UTC) Received: from mail-wr0-f179.google.com (mail-wr0-f179.google.com [209.85.128.179]) by mx1-lw-us.apache.org (ASF Mail Server at mx1-lw-us.apache.org) with ESMTPS id 7DA4D5F1F0 for ; Wed, 2 May 2018 11:37:53 +0000 (UTC) Received: by mail-wr0-f179.google.com with SMTP id f2-v6so1844923wrm.3 for ; Wed, 02 May 2018 04:37:53 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=data-artisans-com.20150623.gappssmtp.com; s=20150623; h=from:message-id:mime-version:subject:date:in-reply-to:cc:to :references; bh=GMAU/busLkV4gijPuGSwsX07awhVX1evRDhMZgU+QgQ=; b=ihdYpr6wKDQ5orlfKFdPqFoQD8dXM9iHLSiAHO+Jk77TVujQDpMhq/QOHg7MXRsnyR 1KqvIEd2wj0H8nq0fy3jyflPAFlCIo3zGlQEvFdn7fdeiXdhmCmV4lIcBbjGeeaPDbCF aWK++SrZi6O86Tm3Cz+ABSRRd01IlHLMGuUS3zrikxy63g/TngeeP1sW/FipWM7bUaU+ HwDlJMaqHl2WEEtAsQpV4GpjtP6hhY8qqvN87k1FHw17JewTTp2mAv9HepejJbR3oEX8 q9JPyViBPWiKwcGGovvgAspLkLm7f4e2U7PkqPQiquFtTGNX8nA7xmEdIrFB70EDizDQ qufw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:from:message-id:mime-version:subject:date :in-reply-to:cc:to:references; bh=GMAU/busLkV4gijPuGSwsX07awhVX1evRDhMZgU+QgQ=; b=Jjaf6WdebvQtlt6f2ohpdCeGfmj7qD54orRcVRARoa84svW1xn/6/deBRrozdQRytu CA1VnRymTVSEp6mzTXlwVRLmOLQemWKtOgEnCI2T2Nggy0BpON68Tq8sLP7fu1PBmY7H jjt0/ezzZP1E0Y0uXx0yHmHncxrFyJBFtNoqZebKTYRHoSoQZPIusr2xfqbJi8x/e0az n7SCIAyqG9xhx3wFEG/6uVkAH9e6jRDcXHdbBGTF5dAyqYFF0BesFEt3r9fZWWCsj5mw gz5L6tXmjQ/3Oknd6VYfNNSDmczGlaRHNRTGjHwm7kRK59v2fH9FYs8t6mEK1VbdtZub GjQA== X-Gm-Message-State: ALQs6tAOnHpWfFNurysnfm83sZ8Tz3ATyT+g9hjR4pD/JeJTBf3yga0U x+6O5+OLdIG8Ma6otuUZlL0yNA== X-Google-Smtp-Source: AB8JxZoAGw87jyUFM6oV6KX6pVQ+e6HjK1QdiUSkiLLmpRgqB7gsksCSiH6bLcuqZyV+cCWE8W1EbQ== X-Received: by 2002:adf:c412:: with SMTP id v18-v6mr15552105wrf.20.1525261072521; Wed, 02 May 2018 04:37:52 -0700 (PDT) Received: from piotrs-mbp.fritz.box (dslb-002-205-086-147.002.205.pools.vodafone-ip.de. [2.205.86.147]) by smtp.gmail.com with ESMTPSA id x128sm3454007wmg.2.2018.05.02.04.37.51 (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Wed, 02 May 2018 04:37:52 -0700 (PDT) From: Piotr Nowojski Message-Id: Content-Type: multipart/alternative; boundary="Apple-Mail=_6A35FBB2-CA34-45CC-9F2E-9A8DCC4E514B" Mime-Version: 1.0 (Mac OS X Mail 11.3 \(3445.6.18\)) Subject: Re: Odd job failure Date: Wed, 2 May 2018 13:37:50 +0200 In-Reply-To: Cc: user To: Elias Levy References: X-Mailer: Apple Mail (2.3445.6.18) --Apple-Mail=_6A35FBB2-CA34-45CC-9F2E-9A8DCC4E514B Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset=utf-8 Hi, It might be some Kafka issue.=20 =46rom what you described your reasoning seems sound. For some reason = TM3 fails and is unable to restart and process any data, thus forcing = spilling on checkpoint barriers on TM1 and TM2. I don=E2=80=99t know the reason behind java.lang.NoClassDefFoundError: = org/apache/kafka/clients/NetworkClient$1 errors, but it doesn=E2=80=99t = seem to be important in this case. 1. What Kafka version are you using? Have you looked for any known Kafka = issues with those symptoms? 2. Maybe the easiest thing will be to reformat/reinstall/recreate TM3 = AWS image? It might be some system issue. Piotrek > On 28 Apr 2018, at 01:54, Elias Levy = wrote: >=20 > We had a job on a Flink 1.4.2 cluster with three TMs experience an odd = failure the other day. It seems that it started as some sort of network = event. =20 >=20 > It began with the 3rd TM starting to warn every 30 seconds about = socket timeouts while sending metrics to DataDog. This latest for the = whole outage. >=20 > Twelve minutes later, all TMs reported at nearly the same time that = they had marked the Kafka coordinator as deed ("Marking the coordinator = XXX (id: 2147482640 rack: null) dead for group ZZZ"). The job = terminated and the system attempted to recover it. Then things got into = a weird state. >=20 > The following related for six or seven times for a period of about 40 = minutes:=20 > TM attempts to restart the job, but only the first and second TMs show = signs of doing so. =20 > The disk begins to fill up on TMs 1 and 2. =20 > TMs 1 & 2 both report java.lang.NoClassDefFoundError: = org/apache/kafka/clients/NetworkClient$1 errors. These were mentioned = on this list earlier this month. It is unclear if the are benign. > The job dies when the disks finally fills up on 1 and 2. >=20 > Looking at the backtrace logged when the disk fills up, I gather that = Flink is buffering data coming from Kafka into one of my operators as a = result of a barrier. The job has a two input operator, with one input = the primary data, and a secondary input for control commands. It would = appear that for whatever reason the barrier for the control stream is = not making it to the operator, thus leading to the buffering and full = disks. Maybe Flink scheduled the operator source of the control stream = on the 3rd TM which seems like it was not scheduling tasks? >=20 > Finally the JM records that it 13 late messages for already expired = checkpoints (could they be from the 3rd TM?), the job is restored one = more time and it works. All TMs report nearly at the same time that = they can now find the Kafka coordinator. >=20 >=20 > Seems like the 3rd TM has some connectivity issue, but then all TMs = seems to have a problem communicating with the Kafka coordinator at the = same time and recovered at the same time. The TMs are hosted in AWS = across AZs, so all of them having connectivity issues at the same time = is suspect. The Kafka node in question was up and other clients in our = infrastructure seems to be able to communicate with it without trouble. = Also, the Flink job itself seemed to be talking to the Kafka cluster = while restarting as it was spilling data to disk coming from Kafka. And = the JM did not report any reduction on available task slots, which would = indicate connectivity issues between the JM and the 3rd TM. Yet, the = logs in the 3rd TM do not show any record of trying to restore the job = during the intermediate attempts. >=20 > What do folks make of it? >=20 >=20 > And a question for Flink devs, is there some reason why Flink does not = stop spilling messages to disk when the disk is going to fill up? Seems = like there should be a configurable limit to how much data can be = spilled before back-pressure is applied to slow down or stop the source. --Apple-Mail=_6A35FBB2-CA34-45CC-9F2E-9A8DCC4E514B Content-Transfer-Encoding: quoted-printable Content-Type: text/html; charset=utf-8 Hi,

It = might be some Kafka issue. 

=46rom what you described your = reasoning seems sound. For some reason TM3 fails and is unable to = restart and process any data, thus forcing spilling on checkpoint = barriers on TM1 and TM2.

I don=E2=80=99t know the reason behind = java.lang.NoClassDefFoundError: org/apache/kafka/clients/NetworkClient$1 = errors, but it doesn=E2=80=99t seem to be important in this = case.

1. What = Kafka version are you using? Have you looked for any known Kafka issues = with those symptoms?
2. Maybe the easiest thing = will be to reformat/reinstall/recreate TM3 AWS image? It might be some = system issue.

Piotrek

On = 28 Apr 2018, at 01:54, Elias Levy <fearsome.lucidity@gmail.com> wrote:

We had a job on a Flink 1.4.2 cluster with three TMs = experience an odd failure the other day.  It seems that it started = as some sort of network event.  

It began with the 3rd TM starting to = warn every 30 seconds about socket timeouts while sending metrics to = DataDog.  This latest for the whole outage.

Twelve minutes later, all TMs reported = at nearly the same time that they had marked the Kafka coordinator as = deed ("Marking the coordinator XXX (id: 2147482640 rack: null) dead for = group ZZZ").  The job terminated and the system attempted to = recover it.  Then things got into a weird state.

The following related = for six or seven times for a period of about 40 minutes: 
  1. TM attempts to restart the job, = but only the first and second TMs show signs of doing so.  
  2. The disk begins to fill up on TMs 1 and 2.  
  3. TMs 1 & 2 both report java.lang.NoClassDefFoundError: = org/apache/kafka/clients/NetworkClient$1 errors.  These were = mentioned on this list earlier this month.  It is unclear if the = are benign.
  4. The job dies when the disks finally fills = up on 1 and 2.

Looking at the backtrace logged = when the disk fills up, I gather that Flink is buffering data coming = from Kafka into one of my operators as a result of a barrier.  The = job has a two input operator, with one input the primary data, and a = secondary input for control commands.  It would appear that for = whatever reason the barrier for the control stream is not making it to = the operator, thus leading to the buffering and full disks.  Maybe = Flink scheduled the operator source of the control stream on the 3rd TM = which seems like it was not scheduling tasks?

Finally the JM records that it 13 = late messages for already expired checkpoints (could they be from the = 3rd TM?), the job is restored one more time and it works.  All TMs = report nearly at the same time that they can now find the Kafka = coordinator.

Seems like the 3rd TM has some = connectivity issue, but then all TMs seems to have a problem = communicating with the Kafka coordinator at the same time and recovered = at the same time.  The TMs are hosted in AWS across AZs, so all of = them having connectivity issues at the same time is suspect.  The = Kafka node in question was up and other clients in our infrastructure = seems to be able to communicate with it without trouble.  Also, the = Flink job itself seemed to be talking to the Kafka cluster while = restarting as it was spilling data to disk coming from Kafka.  And = the JM did not report any reduction on available task slots, which would = indicate connectivity issues between the JM and the 3rd TM.  Yet, = the logs in the 3rd TM do not show any record of trying to restore the = job during the intermediate attempts.

What do folks make of = it?


And a question for Flink devs, is there = some reason why Flink does not stop spilling messages to disk when the = disk is going to fill up?  Seems like there should be a = configurable limit to how much data can be spilled before back-pressure = is applied to slow down or stop the source.

= --Apple-Mail=_6A35FBB2-CA34-45CC-9F2E-9A8DCC4E514B--