From user-return-19849-archive-asf-public=cust-asf.ponee.io@flink.apache.org Mon May 7 14:39:13 2018 Return-Path: X-Original-To: archive-asf-public@cust-asf.ponee.io Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by mx-eu-01.ponee.io (Postfix) with SMTP id 81A49180648 for ; Mon, 7 May 2018 14:39:12 +0200 (CEST) Received: (qmail 19946 invoked by uid 500); 7 May 2018 12:39:11 -0000 Mailing-List: contact user-help@flink.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Delivered-To: mailing list user@flink.apache.org Received: (qmail 19496 invoked by uid 99); 7 May 2018 12:39:10 -0000 Received: from pnap-us-west-generic-nat.apache.org (HELO spamd3-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 07 May 2018 12:39:10 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd3-us-west.apache.org (ASF Mail Server at spamd3-us-west.apache.org) with ESMTP id 0F13518062F for ; Mon, 7 May 2018 12:39:10 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd3-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: 1.879 X-Spam-Level: * X-Spam-Status: No, score=1.879 tagged_above=-999 required=6.31 tests=[DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, HTML_MESSAGE=2, RCVD_IN_DNSWL_NONE=-0.0001, RCVD_IN_MSPIKE_H3=-0.01, RCVD_IN_MSPIKE_WL=-0.01, SPF_PASS=-0.001] autolearn=disabled Authentication-Results: spamd3-us-west.apache.org (amavisd-new); dkim=pass (2048-bit key) header.d=gmail.com Received: from mx1-lw-us.apache.org ([10.40.0.8]) by localhost (spamd3-us-west.apache.org [10.40.0.10]) (amavisd-new, port 10024) with ESMTP id Gzo8cqp3SJ55 for ; Mon, 7 May 2018 12:39:08 +0000 (UTC) Received: from mail-lf0-f52.google.com (mail-lf0-f52.google.com [209.85.215.52]) by mx1-lw-us.apache.org (ASF Mail Server at mx1-lw-us.apache.org) with ESMTPS id 9EA5E5F1E7 for ; Mon, 7 May 2018 12:39:07 +0000 (UTC) Received: by mail-lf0-f52.google.com with SMTP id m18-v6so40383854lfb.0 for ; Mon, 07 May 2018 05:39:07 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=mime-version:in-reply-to:references:from:date:message-id:subject:to :cc; bh=sPBtV9MS3AcUwvzKSiSMm2DgVQ9/QJhAcDCKZWcoykw=; b=nHv/9BS6xGj5nZKYkrqXE/M/ZOmssockQMsYVuM28yLB56KngxD0ULQ12AUvbZ00s9 l4OO7FtYFXDCydWRnzHLdi3hPP39wmSMRWZxcp2gosOD8N2/B+CuQCQeHKPoZ8Qt1QFy BEPtQVWX1EhEw28Beoh9KnvrG2iAm1iOFhZj6UCVpgCYyGDYAbtCAOZ8Veah1Kps2Vt9 EJ34ERMaWR42WqYwAFniAJRpHfOK9Z+XiL1n5Fljkl91CZiG9YKY9ZP+Az9qRZXrLplS 0lTcrvEqUlqjpH2ndkKNeaAH+6Int1s24zu3AVk0xCrcI0ZIP7I0BR2K2R2Wcs7DBa16 svVA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:in-reply-to:references:from:date :message-id:subject:to:cc; bh=sPBtV9MS3AcUwvzKSiSMm2DgVQ9/QJhAcDCKZWcoykw=; b=Gfi7m0VFDonrsYOAzmP19x8JmMogrBFm6jmn6ZBmorTHjRFJa5XxSBA4+W/qiDcOhr MhROs3ivd/lzWj9PCIj7bkmiytNVR0CtBwn/oIoyqAfA4hCTVO4BaJ81X2bnyMbaFlGi nYyxaPD9rDmMcmpunxvHQfPLgbtT2Bxmoe0KYbhctVGqtdKGG4pQrQxDMe7Z4Lddsxs2 4BAMFxlcADsWZ/0XWIgfvQn+Ov2MPMnLEDQb0vIad/X9DsigGAjdwZu5jhUxpTdRbkrX Fx6b4cJXA7oE/VhrCvK+hgUXg2HXRKBSsYRi11g2GEP2MaDjDHvMl0aAU7MSI5BY3crJ EzXw== X-Gm-Message-State: ALKqPwfuZVlOetUrkO+wgQ4nhac0EfqGBbUGggeWWblBx3gditAwWKgU gJJi37dIZ5s4ypEt9OdoHIGaqv2x4oWsem6dJPQ= X-Google-Smtp-Source: AB8JxZotF7dnRC0s+66adbsbChT59rJJ5XvzAUlCm0IX8dE/ANOV9rZJAJmC/4m8y6fDQtJEygoQzADpSmO9NRuzFDM= X-Received: by 2002:a19:2501:: with SMTP id l1-v6mr5196803lfl.69.1525696746172; Mon, 07 May 2018 05:39:06 -0700 (PDT) MIME-Version: 1.0 Received: by 10.46.156.198 with HTTP; Mon, 7 May 2018 05:38:25 -0700 (PDT) In-Reply-To: <4b47699b-f39b-d05d-7fad-32f7aa700120@gmail.com> References: <4b47699b-f39b-d05d-7fad-32f7aa700120@gmail.com> From: Fabian Hueske Date: Mon, 7 May 2018 14:38:25 +0200 Message-ID: Subject: Re: strange behavior with jobmanager.rpc.address on standalone HA cluster To: Derek VerLee Cc: user , Till Rohrmann Content-Type: multipart/alternative; boundary="0000000000009735b9056b9cf4f0" --0000000000009735b9056b9cf4f0 Content-Type: text/plain; charset="UTF-8" Hi Derek, 1. I've created a JIRA issue to improve the docs as you recommended [1]. 2. This discussion goes quite a bit into the internals of the HA setup. Let me pull in Till (in CC) who knows the details of HA. Best, Fabian [1] https://issues.apache.org/jira/browse/FLINK-9309 2018-05-05 15:34 GMT+02:00 Derek VerLee : > Two things: > > 1. It would be beneficial I think to drop a line somewhere in the docs > (probably on the production ready checklist as well as the HA page) > explaining that enabling zookeeper "highavailability" allows for your jobs > to restart automatically after a jobmanager crash or restart. We had spent > some cycles trying to implement job restarting and watchdogs (poorly) when > I discoverd this from a flink forward presentation on youtube. > > 2. I seem to have found some odd behavior with HA and then found something > that works, but I can't explain why. The clifnotes version is that I took > an existing standalone cluster with a single JM and modified with high > availability zookeeper mode. The same flink-conf.yaml file is used on all > nodes (including JM). This seemed to work fine, I restarted the JM (jm0) > and the jobs relaunched when it came back. Easy! Then I deployed a second > JM (jm1). Once I modified `masters`, set the HA rpc port range and opened > those ports on the firewall for both jobmanagers, but left > `jobmanager.rpc.address` the original value, `jm0` on all nodes. I then > observed that jm0 worked fine, taskmanagers connected to it and jobs ran. > jm1 did not 301 me to jm0 however, it displayed a dashboard (no jobs, no > tm). When I stopped jm0, the jobs show up on jm1 as RESTARTING, but the > taskmanagers never attach to jm1. In the logs, all nodes, including jm1, > had messages about trying to reach jm0. From the documentation and various > comments I've seen, `jobmanager.rpc.address` should be ignored. However, > commenting it out entirely lead to jobmanagers crashing at boot, setting to > `localhost` caused all the taskmanagers to log messages about trying to > connect to the jobmanager at localhost. What finally worked was to set the > value to the hostname where the flink-conf.yaml was individually, even on > the taskmanagers. > > Does this seem like a bug? > > Just a hunch, but is there something called an "akka leader" that is > different from the jobmanager leader, and could it be somehow defaulting > its value over to jobmanager.rpc.address? > > > --0000000000009735b9056b9cf4f0 Content-Type: text/html; charset="UTF-8" Content-Transfer-Encoding: quoted-printable
Hi Derek,

1. I've created = a JIRA issue to improve the docs as you recommended [1].

2. Th= is discussion goes quite a bit into the internals of the HA setup. Let me p= ull in Till (in CC) who knows the details of HA.

Best, Fabian<= br>=

2018-05-05 = 15:34 GMT+02:00 Derek VerLee <derekverlee@gmail.com>:
=20 =20 =20

Two things:

1. It would be beneficial I think to drop a line somewhere in the docs (probably on the production ready checklist as well as the HA page) explaining that enabling zookeeper "highavailability"= allows for your jobs to restart automatically after a jobmanager crash or restart.=C2=A0 We had spent some cycles trying to implement job restarting and watchdogs (poorly) when I discoverd this from a flink forward presentation on youtube.

2. I seem to have found some odd behavior with HA and then found something that works, but I can't explain why.=C2=A0 The clifnote= s version is that I took an existing standalone cluster with a single JM and modified with high availability zookeeper mode.=C2=A0 T= he same flink-conf.yaml file is used on all nodes (including JM). This seemed to work fine, I restarted the JM (jm0) and the jobs relaunched when it came back.=C2=A0 Easy!=C2=A0 Then I deployed a sec= ond JM (jm1).=C2=A0 Once I modified `masters`, set the HA rpc port range and opened those ports on the firewall for both jobmanagers, but left `jobmanager.rpc.address` the original value, `jm0` on all nodes.=C2= =A0 I then observed that jm0 worked fine, taskmanagers connected to it and jobs ran.=C2=A0 jm1 did not 301 me to jm0 however, it displayed a dashboard (no jobs, no tm).=C2=A0 When I stopped jm0, the jobs show u= p on jm1 as RESTARTING, but the taskmanagers never attach to jm1. =C2= =A0 In the logs, all nodes, including jm1, had messages about trying to reach jm0.=C2=A0 From the documentation and various comments I'= ;ve seen, `jobmanager.rpc.address` should be ignored.=C2=A0 However, commenting it out entirely lead to jobmanagers crashing at boot, setting to `localhost` caused all the taskmanagers to log messages about trying to connect to the jobmanager at localhost.=C2=A0 What finally worked was to set the value to the hostname where the flink-conf.yaml was individually, even on the taskmanagers.=C2=A0

Does this seem like a bug?

Just a hunch, but is there something called an "akka leader&quo= t; that is different from the jobmanager leader, and could it be somehow defaulting its value over to jobmanager.rpc.address?



--0000000000009735b9056b9cf4f0--