From user-return-21819-archive-asf-public=cust-asf.ponee.io@flink.apache.org Wed Aug 1 18:49:42 2018 Return-Path: X-Original-To: archive-asf-public@cust-asf.ponee.io Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by mx-eu-01.ponee.io (Postfix) with SMTP id 5AE2D180634 for ; Wed, 1 Aug 2018 18:49:42 +0200 (CEST) Received: (qmail 36688 invoked by uid 500); 1 Aug 2018 16:49:36 -0000 Mailing-List: contact user-help@flink.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Delivered-To: mailing list user@flink.apache.org Received: (qmail 36678 invoked by uid 99); 1 Aug 2018 16:49:36 -0000 Received: from pnap-us-west-generic-nat.apache.org (HELO spamd3-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 01 Aug 2018 16:49:36 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd3-us-west.apache.org (ASF Mail Server at spamd3-us-west.apache.org) with ESMTP id 06DF1180A42 for ; Wed, 1 Aug 2018 16:49:36 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd3-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: 1.888 X-Spam-Level: * X-Spam-Status: No, score=1.888 tagged_above=-999 required=6.31 tests=[DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, HTML_MESSAGE=2, RCVD_IN_DNSWL_NONE=-0.0001, RCVD_IN_MSPIKE_H2=-0.001, SPF_PASS=-0.001, T_DKIMWL_WL_MED=-0.01] autolearn=disabled Authentication-Results: spamd3-us-west.apache.org (amavisd-new); dkim=pass (2048-bit key) header.d=gmail.com Received: from mx1-lw-eu.apache.org ([10.40.0.8]) by localhost (spamd3-us-west.apache.org [10.40.0.10]) (amavisd-new, port 10024) with ESMTP id DgeB0SmlcJF6 for ; Wed, 1 Aug 2018 16:49:34 +0000 (UTC) Received: from mail-lj1-f179.google.com (mail-lj1-f179.google.com [209.85.208.179]) by mx1-lw-eu.apache.org (ASF Mail Server at mx1-lw-eu.apache.org) with ESMTPS id 60CB25F125 for ; Wed, 1 Aug 2018 16:49:33 +0000 (UTC) Received: by mail-lj1-f179.google.com with SMTP id f1-v6so17463962ljc.9 for ; Wed, 01 Aug 2018 09:49:33 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=mime-version:references:in-reply-to:from:date:message-id:subject:to :cc; bh=xt6WgKY7Y7+n4+6PT24/Nw0639Mx4IqFVQ+7brQHupI=; b=TXlzEjUi1JG0/6nGd1PGQT35FAxyDICEe7HBJNZ3YkCML0/OeJBPZPOK0IC/zu8hBX eMeUkyIhoMia7508vAJI412GQfO0i8xMejUHQE5QvgIVA7QG6BmE7Qo6QXojlGftnTSC Y/0v8VR6QnoC2k4rHITWwHtptPkBtwd0x2CVn2j5TYhhEbTBuHtGWjtvxdnagQbTsoF2 +6RaX/Lq13gmXQoFAyGlqnKTgasulysp/W+jcJqMccVEyOZ3UXdFojpmg0ID4BURRQsW RB7VIBllGPOnPoK5+O//BDY2iGOZVMRe4iqNIVGK6m0thOq1HAH36M0n5pI6Myo7UO0/ m1ug== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to:cc; bh=xt6WgKY7Y7+n4+6PT24/Nw0639Mx4IqFVQ+7brQHupI=; b=Nz2JKKw2tcFyO4Bt2n9GV54/+1UgeHWB4AyyOnRkDc+YvAVuqGpC6IXnVfcAt7zEA8 Jwdo6Inl0ArVCtPF8mL2G9c6fael3y0IdG3USeyo7/vbWB0y1qZdBx0pBVoYp/NiWYYb wzQicHOZGMYvldpi6RqSdEWUvGxCvw8hMEKs/gNdlP23vCKX7Vb9bn735hoXu4slzppk n+FYFXZfNHBlehv/lMRoQ8RJYEEyNbedRmLRgm3uPp9gVgxG4YEFj8pMd7B0XZ7Ou9D6 51SWxMmPcDOUj1K08BLbU0IijHCpIvNs1WKc8laVsf7WtLuF21Ux4NCNx5jVJ4n7sPdS 0jPQ== X-Gm-Message-State: AOUpUlEyiOL2C0Oy+i1PzRn9xiAKOsTncOS7bYkCBRBfobjH5+qcs5aU KCz4UfPP2k+2OmVAS7PZY5adABvFEx61puFl6Uo= X-Google-Smtp-Source: AAOMgpf9wM/8EVfPjDcwQAi21MDcA2/dspmNhuK/OIM7rMuyGAblb1UPZZrzU/7XERYknDzIrWLmTqVlN1myDvekFiA= X-Received: by 2002:a2e:8:: with SMTP id 8-v6mr20185401lja.112.1533142172783; Wed, 01 Aug 2018 09:49:32 -0700 (PDT) MIME-Version: 1.0 References: In-Reply-To: From: Elias Levy Date: Wed, 1 Aug 2018 09:49:21 -0700 Message-ID: Subject: Re: Old job resurrected during HA failover To: yanghua1127@gmail.com Cc: user Content-Type: multipart/alternative; boundary="0000000000009959760572627ac3" --0000000000009959760572627ac3 Content-Type: text/plain; charset="UTF-8" Vino, Thanks for the reply. Looking in ZK I see: [zk: localhost:2181(CONNECTED) 5] ls /flink/cluster_1/jobgraphs [d77948df92813a68ea6dfd6783f40e7e, 2a4eff355aef849c5ca37dbac04f2ff1] Again we see HA state for job 2a4eff355aef849c5ca37dbac04f2ff1, even though that job is no longer running (it was canceled while it was in a loop attempting to restart, but failing because of a lack of cluster slots). Any idea why that may be the case? On Wed, Aug 1, 2018 at 8:38 AM vino yang wrote: > If a job is explicitly canceled, its jobgraph node on ZK will be deleted. > However, it is worth noting here that Flink enables a background thread to > asynchronously delete the jobGraph node, > so there may be cases where it cannot be deleted. > On the other hand, the jobgraph node on ZK is the only basis for the JM > leader to restore the job. > There may be an unexpected recovery or an old job resurrection. > --0000000000009959760572627ac3 Content-Type: text/html; charset="UTF-8" Content-Transfer-Encoding: quoted-printable
Vino,

Thanks for the reply.=C2=A0 Looki= ng in ZK I see:

[zk: localhost:2181(CONNECTED= ) 5] ls /flink/cluster_1/jobgraphs
[d77948df92813a68ea6dfd6783f40= e7e, 2a4eff355aef849c5ca37dbac04f2ff1]

Again= we see HA state for job 2a4eff355aef849c5ca37dbac04f2ff1, even though that= job is no longer running (it was canceled while it was in a loop attemptin= g to restart, but failing because of a lack of cluster slots).
Any idea why that may be the case?


On Wed, Aug 1, 2018 at 8:38 AM vin= o yang <yanghua1127@gmail.com> wrote:
--0000000000009959760572627ac3--