From user-return-19849-archive-asf-public=cust-asf.ponee.io@flink.apache.org  Mon May  7 14:39:13 2018
Return-Path: <user-return-19849-archive-asf-public=cust-asf.ponee.io@flink.apache.org>
X-Original-To: archive-asf-public@cust-asf.ponee.io
Delivered-To: archive-asf-public@cust-asf.ponee.io
Received: from mail.apache.org (hermes.apache.org [140.211.11.3])
	by mx-eu-01.ponee.io (Postfix) with SMTP id 81A49180648
	for <archive-asf-public@cust-asf.ponee.io>; Mon,  7 May 2018 14:39:12 +0200 (CEST)
Received: (qmail 19946 invoked by uid 500); 7 May 2018 12:39:11 -0000
Mailing-List: contact user-help@flink.apache.org; run by ezmlm
Precedence: bulk
List-Help: <mailto:user-help@flink.apache.org>
List-Unsubscribe: <mailto:user-unsubscribe@flink.apache.org>
List-Post: <mailto:user@flink.apache.org>
List-Id: <user.flink.apache.org>
Delivered-To: mailing list user@flink.apache.org
Received: (qmail 19496 invoked by uid 99); 7 May 2018 12:39:10 -0000
Received: from pnap-us-west-generic-nat.apache.org (HELO spamd3-us-west.apache.org) (209.188.14.142)
    by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 07 May 2018 12:39:10 +0000
Received: from localhost (localhost [127.0.0.1])
	by spamd3-us-west.apache.org (ASF Mail Server at spamd3-us-west.apache.org) with ESMTP id 0F13518062F
	for <user@flink.apache.org>; Mon,  7 May 2018 12:39:10 +0000 (UTC)
X-Virus-Scanned: Debian amavisd-new at spamd3-us-west.apache.org
X-Spam-Flag: NO
X-Spam-Score: 1.879
X-Spam-Level: *
X-Spam-Status: No, score=1.879 tagged_above=-999 required=6.31
	tests=[DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1,
	HTML_MESSAGE=2, RCVD_IN_DNSWL_NONE=-0.0001, RCVD_IN_MSPIKE_H3=-0.01,
	RCVD_IN_MSPIKE_WL=-0.01, SPF_PASS=-0.001] autolearn=disabled
Authentication-Results: spamd3-us-west.apache.org (amavisd-new);
	dkim=pass (2048-bit key) header.d=gmail.com
Received: from mx1-lw-us.apache.org ([10.40.0.8])
	by localhost (spamd3-us-west.apache.org [10.40.0.10]) (amavisd-new, port 10024)
	with ESMTP id Gzo8cqp3SJ55 for <user@flink.apache.org>;
	Mon,  7 May 2018 12:39:08 +0000 (UTC)
Received: from mail-lf0-f52.google.com (mail-lf0-f52.google.com [209.85.215.52])
	by mx1-lw-us.apache.org (ASF Mail Server at mx1-lw-us.apache.org) with ESMTPS id 9EA5E5F1E7
	for <user@flink.apache.org>; Mon,  7 May 2018 12:39:07 +0000 (UTC)
Received: by mail-lf0-f52.google.com with SMTP id m18-v6so40383854lfb.0
        for <user@flink.apache.org>; Mon, 07 May 2018 05:39:07 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=gmail.com; s=20161025;
        h=mime-version:in-reply-to:references:from:date:message-id:subject:to
         :cc;
        bh=sPBtV9MS3AcUwvzKSiSMm2DgVQ9/QJhAcDCKZWcoykw=;
        b=nHv/9BS6xGj5nZKYkrqXE/M/ZOmssockQMsYVuM28yLB56KngxD0ULQ12AUvbZ00s9
         l4OO7FtYFXDCydWRnzHLdi3hPP39wmSMRWZxcp2gosOD8N2/B+CuQCQeHKPoZ8Qt1QFy
         BEPtQVWX1EhEw28Beoh9KnvrG2iAm1iOFhZj6UCVpgCYyGDYAbtCAOZ8Veah1Kps2Vt9
         EJ34ERMaWR42WqYwAFniAJRpHfOK9Z+XiL1n5Fljkl91CZiG9YKY9ZP+Az9qRZXrLplS
         0lTcrvEqUlqjpH2ndkKNeaAH+6Int1s24zu3AVk0xCrcI0ZIP7I0BR2K2R2Wcs7DBa16
         svVA==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20161025;
        h=x-gm-message-state:mime-version:in-reply-to:references:from:date
         :message-id:subject:to:cc;
        bh=sPBtV9MS3AcUwvzKSiSMm2DgVQ9/QJhAcDCKZWcoykw=;
        b=Gfi7m0VFDonrsYOAzmP19x8JmMogrBFm6jmn6ZBmorTHjRFJa5XxSBA4+W/qiDcOhr
         MhROs3ivd/lzWj9PCIj7bkmiytNVR0CtBwn/oIoyqAfA4hCTVO4BaJ81X2bnyMbaFlGi
         nYyxaPD9rDmMcmpunxvHQfPLgbtT2Bxmoe0KYbhctVGqtdKGG4pQrQxDMe7Z4Lddsxs2
         4BAMFxlcADsWZ/0XWIgfvQn+Ov2MPMnLEDQb0vIad/X9DsigGAjdwZu5jhUxpTdRbkrX
         Fx6b4cJXA7oE/VhrCvK+hgUXg2HXRKBSsYRi11g2GEP2MaDjDHvMl0aAU7MSI5BY3crJ
         EzXw==
X-Gm-Message-State: ALKqPwfuZVlOetUrkO+wgQ4nhac0EfqGBbUGggeWWblBx3gditAwWKgU
	gJJi37dIZ5s4ypEt9OdoHIGaqv2x4oWsem6dJPQ=
X-Google-Smtp-Source: AB8JxZotF7dnRC0s+66adbsbChT59rJJ5XvzAUlCm0IX8dE/ANOV9rZJAJmC/4m8y6fDQtJEygoQzADpSmO9NRuzFDM=
X-Received: by 2002:a19:2501:: with SMTP id l1-v6mr5196803lfl.69.1525696746172;
 Mon, 07 May 2018 05:39:06 -0700 (PDT)
MIME-Version: 1.0
Received: by 10.46.156.198 with HTTP; Mon, 7 May 2018 05:38:25 -0700 (PDT)
In-Reply-To: <4b47699b-f39b-d05d-7fad-32f7aa700120@gmail.com>
References: <4b47699b-f39b-d05d-7fad-32f7aa700120@gmail.com>
From: Fabian Hueske <fhueske@gmail.com>
Date: Mon, 7 May 2018 14:38:25 +0200
Message-ID: <CAAdrtT2PukyuaGYR+HWYtvjq5vzjc0Hm7die-MLNyb7OmPMS+Q@mail.gmail.com>
Subject: Re: strange behavior with jobmanager.rpc.address on standalone HA cluster
To: Derek VerLee <derekverlee@gmail.com>
Cc: user <user@flink.apache.org>, Till Rohrmann <trohrmann@apache.org>
Content-Type: multipart/alternative; boundary="0000000000009735b9056b9cf4f0"

--0000000000009735b9056b9cf4f0
Content-Type: text/plain; charset="UTF-8"

Hi Derek,

1. I've created a JIRA issue to improve the docs as you recommended [1].

2. This discussion goes quite a bit into the internals of the HA setup. Let
me pull in Till (in CC) who knows the details of HA.

Best, Fabian

[1] https://issues.apache.org/jira/browse/FLINK-9309

2018-05-05 15:34 GMT+02:00 Derek VerLee <derekverlee@gmail.com>:

> Two things:
>
> 1. It would be beneficial I think to drop a line somewhere in the docs
> (probably on the production ready checklist as well as the HA page)
> explaining that enabling zookeeper "highavailability" allows for your jobs
> to restart automatically after a jobmanager crash or restart.  We had spent
> some cycles trying to implement job restarting and watchdogs (poorly) when
> I discoverd this from a flink forward presentation on youtube.
>
> 2. I seem to have found some odd behavior with HA and then found something
> that works, but I can't explain why.  The clifnotes version is that I took
> an existing standalone cluster with a single JM and modified with high
> availability zookeeper mode.  The same flink-conf.yaml file is used on all
> nodes (including JM). This seemed to work fine, I restarted the JM (jm0)
> and the jobs relaunched when it came back.  Easy!  Then I deployed a second
> JM (jm1).  Once I modified `masters`, set the HA rpc port range and opened
> those ports on the firewall for both jobmanagers, but left
> `jobmanager.rpc.address` the original value, `jm0` on all nodes.  I then
> observed that jm0 worked fine, taskmanagers connected to it and jobs ran.
> jm1 did not 301 me to jm0 however, it displayed a dashboard (no jobs, no
> tm).  When I stopped jm0, the jobs show up on jm1 as RESTARTING, but the
> taskmanagers never attach to jm1.   In the logs, all nodes, including jm1,
> had messages about trying to reach jm0.  From the documentation and various
> comments I've seen, `jobmanager.rpc.address` should be ignored.  However,
> commenting it out entirely lead to jobmanagers crashing at boot, setting to
> `localhost` caused all the taskmanagers to log messages about trying to
> connect to the jobmanager at localhost.  What finally worked was to set the
> value to the hostname where the flink-conf.yaml was individually, even on
> the taskmanagers.
>
> Does this seem like a bug?
>
> Just a hunch, but is there something called an "akka leader" that is
> different from the jobmanager leader, and could it be somehow defaulting
> its value over to jobmanager.rpc.address?
>
>
>

--0000000000009735b9056b9cf4f0
Content-Type: text/html; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable

<div dir=3D"ltr"><div><div><div>Hi Derek,<br><br></div>1. I&#39;ve created =
a JIRA issue to improve the docs as you recommended [1].<br><br></div>2. Th=
is discussion goes quite a bit into the internals of the HA setup. Let me p=
ull in Till (in CC) who knows the details of HA.<br><br></div>Best, Fabian<=
br><div><div><br>[1] <a href=3D"https://issues.apache.org/jira/browse/FLINK=
-9309">https://issues.apache.org/jira/browse/FLINK-9309</a><br></div></div>=
</div><div class=3D"gmail_extra"><br><div class=3D"gmail_quote">2018-05-05 =
15:34 GMT+02:00 Derek VerLee <span dir=3D"ltr">&lt;<a href=3D"mailto:derekv=
erlee@gmail.com" target=3D"_blank">derekverlee@gmail.com</a>&gt;</span>:<br=
><blockquote class=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;border-left:1=
px #ccc solid;padding-left:1ex">
 =20

   =20
 =20
  <div bgcolor=3D"#FFFFFF" text=3D"#000000">
    <p>Two things:<br>
    </p>
    <p>1. It would be beneficial I think to drop a line somewhere in the
      docs (probably on the production ready checklist as well as the HA
      page) explaining that enabling zookeeper &quot;highavailability&quot;=
 allows
      for your jobs to restart automatically after a jobmanager crash or
      restart.=C2=A0 We had spent some cycles trying to implement job
      restarting and watchdogs (poorly) when I discoverd this from a
      flink forward presentation on youtube.<br>
    </p>
    <p>2. I seem to have found some odd behavior with HA and then found
      something that works, but I can&#39;t explain why.=C2=A0 The clifnote=
s
      version is that I took an existing standalone cluster with a
      single JM and modified with high availability zookeeper mode.=C2=A0 T=
he
      same flink-conf.yaml file is used on all nodes (including JM).
      This seemed to work fine, I restarted the JM (jm0) and the jobs
      relaunched when it came back.=C2=A0 Easy!=C2=A0 Then I deployed a sec=
ond JM
      (jm1).=C2=A0 Once I modified `masters`, set the HA rpc port range and
      opened those ports on the firewall for both jobmanagers, but left
      `jobmanager.rpc.address` the original value, `jm0` on all nodes.=C2=
=A0
      I then observed that jm0 worked fine, taskmanagers connected to it
      and jobs ran.=C2=A0 jm1 did not 301 me to jm0 however, it displayed a
      dashboard (no jobs, no tm).=C2=A0 When I stopped jm0, the jobs show u=
p
      on jm1 as RESTARTING, but the taskmanagers never attach to jm1. =C2=
=A0
      In the logs, all nodes, including jm1, had messages about trying
      to reach jm0.=C2=A0 From the documentation and various comments I&#39=
;ve
      seen, `jobmanager.rpc.address` should be ignored.=C2=A0 However,
      commenting it out entirely lead to jobmanagers crashing at boot,
      setting to `localhost` caused all the taskmanagers to log messages
      about trying to connect to the jobmanager at localhost.=C2=A0 What
      finally worked was to set the value to the hostname where the
      flink-conf.yaml was individually, even on the taskmanagers.=C2=A0 <br=
>
    </p>
    <p>Does this seem like a bug?</p>
    <p>Just a hunch, but is there something called an &quot;akka leader&quo=
t; that
      is different from the jobmanager leader, and could it be somehow
      defaulting its value over to jobmanager.rpc.address?</p>
    <p><br>
    </p>
  </div>

</blockquote></div><br></div>

--0000000000009735b9056b9cf4f0--