From user-return-22063-archive-asf-public=cust-asf.ponee.io@flink.apache.org Mon Aug 13 09:52:22 2018 Return-Path: X-Original-To: archive-asf-public@cust-asf.ponee.io Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by mx-eu-01.ponee.io (Postfix) with SMTP id D0EEE180629 for ; Mon, 13 Aug 2018 09:52:21 +0200 (CEST) Received: (qmail 70646 invoked by uid 500); 13 Aug 2018 07:52:17 -0000 Mailing-List: contact user-help@flink.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Delivered-To: mailing list user@flink.apache.org Received: (qmail 70636 invoked by uid 99); 13 Aug 2018 07:52:17 -0000 Received: from pnap-us-west-generic-nat.apache.org (HELO spamd1-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 13 Aug 2018 07:52:17 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd1-us-west.apache.org (ASF Mail Server at spamd1-us-west.apache.org) with ESMTP id 0FA68C98D0 for ; Mon, 13 Aug 2018 07:52:17 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd1-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: 2.399 X-Spam-Level: ** X-Spam-Status: No, score=2.399 tagged_above=-999 required=6.31 tests=[DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, HTML_MESSAGE=2, KAM_NUMSUBJECT=0.5, RCVD_IN_DNSWL_NONE=-0.0001, SPF_PASS=-0.001] autolearn=disabled Authentication-Results: spamd1-us-west.apache.org (amavisd-new); dkim=pass (1024-bit key) header.d=rovio.com Received: from mx1-lw-eu.apache.org ([10.40.0.8]) by localhost (spamd1-us-west.apache.org [10.40.0.7]) (amavisd-new, port 10024) with ESMTP id MJYzpaivwGjI for ; Mon, 13 Aug 2018 07:52:15 +0000 (UTC) Received: from mail-wm0-f46.google.com (mail-wm0-f46.google.com [74.125.82.46]) by mx1-lw-eu.apache.org (ASF Mail Server at mx1-lw-eu.apache.org) with ESMTPS id 9A8EF5F42F for ; Mon, 13 Aug 2018 07:52:15 +0000 (UTC) Received: by mail-wm0-f46.google.com with SMTP id c14-v6so7836921wmb.4 for ; Mon, 13 Aug 2018 00:52:15 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=rovio.com; s=google; h=mime-version:references:in-reply-to:from:date:message-id:subject:to :cc; bh=w/wYGKtm9NHeHxr+XDMCRzKSYKdujII/B4XdXo+cjmw=; b=hrk4pY01iwhkmqT9WbpLdv0+kpleQ1436mzWMseeCg2U7H+/9/2XpLIuKc6LvbHD1V PwVoB6xMTboRglzwkA5OecL6sMgNmtZbBQUrpofztsvQ6Gs5Xr7R2Sv4k/JRJ3L0TLYu 8e7YaQ5JaA4R7NzEptc6uaYHna9RYtI3Q9noU= X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to:cc; bh=w/wYGKtm9NHeHxr+XDMCRzKSYKdujII/B4XdXo+cjmw=; b=D5dtsP0sqUYE0IMEE1LPqK4hdSpWoCeOU4tihpLRuhxvwxb+IYJInA+2LD/tLbTO3y eyXSQ7b4aVfI3+b+bxlJ+VmiEofmlHFunQGsEPcK1vpt22UG27xoFe+OlPjkTseqyQCK olLYh53bNvAbC7RVj2iwjbFSTaUC927HHDL4JXIhLW7H1dz+evEk16ypnPx4p6vPRos4 vGsYvGo52qBxUjt8JfoJ2Tgm1QF6fjxGbu4U4K3mUbFFNy+RjuuUoZexdKVPsUCk9Kig YtQ/vgMQ0WUQqSKYHc8xLPoWcEldGJvX4MH3z+Qwy3qAsiKfo9DycBT8EAvcW4XZjcdq bsuw== X-Gm-Message-State: AOUpUlHZXYYog6lAsJ/Ir2imYY7EXrgwgFLxpxBSAN7mHgaqSJ2i8ue/ sgKmI+F3vCNghEksj1R59GxEx2XOljBpKaEmUY7QXg== X-Google-Smtp-Source: AA+uWPzM046hefdoqikDThOD4ZHQSShJYRwUTLM6VYCK4/xtrJ+yyq2nFliTLQVNQcR7WGjyFXHa9vyRaEoAP8KYlYk= X-Received: by 2002:a1c:497:: with SMTP id 145-v6mr7616896wme.157.1534146735178; Mon, 13 Aug 2018 00:52:15 -0700 (PDT) MIME-Version: 1.0 References: In-Reply-To: From: Juho Autio Date: Mon, 13 Aug 2018 10:52:04 +0300 Message-ID: Subject: Re: 1.5.1 To: vishal.santoshi@gmail.com Cc: Gary Yao , user Content-Type: multipart/alternative; boundary="0000000000002f202905734c5fe1" --0000000000002f202905734c5fe1 Content-Type: text/plain; charset="UTF-8" I also have jobs failing on a daily basis with the error "Heartbeat of TaskManager with id timed out". I'm using Flink 1.5.2. Could anyone suggest how to debug possible causes? I already set these in flink-conf.yaml, but I'm still getting failures: heartbeat.interval: 10000 heartbeat.timeout: 100000 Thanks. On Sun, Jul 22, 2018 at 2:20 PM Vishal Santoshi wrote: > According to the UI it seems that " > > org.apache.flink.util.FlinkException: The assigned slot 208af709ef7be2d2dfc028ba3bbf4600_10 was removed. > > " was the cause of a pipe restart. > > As to the TM it is an artifact of the new job allocation regime which will > exhaust all slots on a TM rather then distributing them equitably. TMs > selectively are under more stress then in a pure RR distribution I think. > We may have to lower the slots on each TM to define a good upper bound. You > are correct 50s is a a pretty generous value. > > On Sun, Jul 22, 2018 at 6:55 AM, Gary Yao wrote: > >> Hi, >> >> The first exception should be only logged on info level. It's expected to >> see >> this exception when a TaskManager unregisters from the ResourceManager. >> >> Heartbeats can be configured via heartbeat.interval and hearbeat.timeout >> [1]. >> The default timeout is 50s, which should be a generous value. It is >> probably a >> good idea to find out why the heartbeats cannot be answered by the TM. >> >> Best, >> Gary >> >> [1] >> https://ci.apache.org/projects/flink/flink-docs-release-1.5/ops/config.html#heartbeat-manager >> >> >> On Sun, Jul 22, 2018 at 1:36 AM, Vishal Santoshi < >> vishal.santoshi@gmail.com> wrote: >> >>> 2 issues we are seeing on 1.5.1 on a streaming pipe line >>> >>> org.apache.flink.util.FlinkException: The assigned slot 208af709ef7be2d2dfc028ba3bbf4600_10 was removed. >>> >>> >>> and >>> >>> java.util.concurrent.TimeoutException: Heartbeat of TaskManager with id 208af709ef7be2d2dfc028ba3bbf4600 timed out. >>> >>> >>> Not sure about the first but how do we increase the heartbeat interval >>> of a TM >>> >>> Thanks much >>> >>> Vishal >>> >> >> > --0000000000002f202905734c5fe1 Content-Type: text/html; charset="UTF-8" Content-Transfer-Encoding: quoted-printable
I also have jobs failing on a daily basis with the error &= quot;Heartbeat of TaskManager with id <id> timed out". I'm u= sing Flink 1.5.2.

Could anyone suggest how to debug poss= ible causes?

I already set these in flink-conf.yaml, but= I'm still getting failures:
heartbeat.interval: 10000
heartbeat.timeout: 100000

Thanks.=

On Sun, Jul 22, 2018 = at 2:20 PM Vishal Santoshi <vishal.santoshi@gmail.com> wrote:
According to the UI it seems tha= t "
org.apache.flink.util.FlinkException: The assigned slo=
t 208af709ef7be2d2dfc028ba3bbf4600_10 was removed.
" was the caus= e of a pipe restart.

As to the TM it is an artifact of t= he new job allocation regime which will exhaust all slots on a TM rather th= en distributing them equitably.=C2=A0 TMs selectively are under more stress= then in a pure RR distribution I think. We may have to lower the slots on = each TM to define a good upper bound. You are correct 50s is a a pretty gen= erous value.

On Sun, Jul 22, 2018 at 6:55 AM, Gary Yao <gary@data-artisans.= com> wrote:
Hi,

The first exception should be only logged o= n info level. It's expected to see
this exception when a TaskManager= unregisters from the ResourceManager.

Heartbeats can be configured = via heartbeat.interval and hearbeat.timeout [1].
The default timeout is = 50s, which should be a generous value. It is probably a
good idea to fin= d out why the heartbeats cannot be answered by the TM.

Best,
Gary=

[1] https://ci.apa= che.org/projects/flink/flink-docs-release-1.5/ops/config.html#heartbeat-man= ager


On Sun, Jul 22, 2018 at 1:36 AM, Vishal Santoshi <vishal.santoshi@gmail.com> wrote:
2 issues we are seeing on= 1.5.1 on a streaming pipe line=C2=A0

org.apache.flink.util.FlinkException: The assigned slot 208af709ef7=
be2d2dfc028ba3bbf4600_10 was removed.

and
java.util.concurrent.TimeoutException=
: Heartbeat of TaskManager with id 208af709ef7be2d2dfc028ba3bbf4600 timed o=
ut.

Not sure about the first but how do we increase the= heartbeat interval of a TM

Thanks much=C2=A0

Vishal



--0000000000002f202905734c5fe1--