Return-Path: X-Original-To: apmail-flink-dev-archive@www.apache.org Delivered-To: apmail-flink-dev-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id E075C184D5 for ; Thu, 17 Dec 2015 14:26:23 +0000 (UTC) Received: (qmail 91562 invoked by uid 500); 17 Dec 2015 14:26:23 -0000 Delivered-To: apmail-flink-dev-archive@flink.apache.org Received: (qmail 91478 invoked by uid 500); 17 Dec 2015 14:26:23 -0000 Mailing-List: contact dev-help@flink.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@flink.apache.org Delivered-To: mailing list dev@flink.apache.org Received: (qmail 91467 invoked by uid 99); 17 Dec 2015 14:26:23 -0000 Received: from Unknown (HELO spamd1-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 17 Dec 2015 14:26:23 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd1-us-west.apache.org (ASF Mail Server at spamd1-us-west.apache.org) with ESMTP id EF97CC70B1 for ; Thu, 17 Dec 2015 14:26:22 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd1-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: 0.982 X-Spam-Level: X-Spam-Status: No, score=0.982 tagged_above=-999 required=6.31 tests=[HEADER_FROM_DIFFERENT_DOMAINS=0.001, KAM_LAZY_DOMAIN_SECURITY=1, RCVD_IN_MSPIKE_H3=-0.01, RCVD_IN_MSPIKE_WL=-0.01, URIBL_BLOCKED=0.001] autolearn=disabled Received: from mx1-us-west.apache.org ([10.40.0.8]) by localhost (spamd1-us-west.apache.org [10.40.0.7]) (amavisd-new, port 10024) with ESMTP id EAzHqUAjCJuK for ; Thu, 17 Dec 2015 14:26:11 +0000 (UTC) Received: from mail-wm0-f54.google.com (mail-wm0-f54.google.com [74.125.82.54]) by mx1-us-west.apache.org (ASF Mail Server at mx1-us-west.apache.org) with ESMTPS id 7F00E2049F for ; Thu, 17 Dec 2015 14:26:11 +0000 (UTC) Received: by mail-wm0-f54.google.com with SMTP id p187so23804097wmp.0 for ; Thu, 17 Dec 2015 06:26:11 -0800 (PST) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20130820; h=x-gm-message-state:content-type:mime-version:subject:from :in-reply-to:date:content-transfer-encoding:message-id:references:to; bh=kawG359LEfk1jxPXwFbtRZGwSnZaq8d466UBj+ILsmI=; b=de63rIPPg5qkl7LSxMbtamA+emARuQQ6rYo7/vHWRJrHv39Bgpp2xbfJIwkkNvwzKj cIXMo2sCs8Xyxh94bXlkkYqUU2p+dM0wFDnsgMdZq3bmp2VkNGRD43NRo1pwToaqC2vf Q0YH/1o1ASszfx5PhWn2A15nj430GAzaVKLRPJSZ5wHBQHdGd4IstVOjrx8lYZtvMolL EvfguXbOpLXoDP1Ob8s3b3JofyxpVASC9WR7ipts9mqMtTjAnNlwVUnYAlIomdwOcJg+ luIsukZ9L88Z77dGL6gFUccDza9GaA4rogoVArxT7Jy0mWtERAALMs6+lJf4SGyFSwpt eJoA== X-Gm-Message-State: ALoCoQm6qyKvTt/VhwMX4MwfqPojlmpZshzExjYJpNKEzcx9wVNdp1zwbqkeW7PGN/mDejrNru6w6jzZ3QIGM28Z4c2DV0jjVQ== X-Received: by 10.194.79.201 with SMTP id l9mr64303430wjx.151.1450362369845; Thu, 17 Dec 2015 06:26:09 -0800 (PST) Received: from vinci.fritz.box (ip5b40315a.dynamic.kabel-deutschland.de. [91.64.49.90]) by smtp.googlemail.com with ESMTPSA id qc2sm3660329wjc.24.2015.12.17.06.26.09 for (version=TLSv1/SSLv3 cipher=OTHER); Thu, 17 Dec 2015 06:26:09 -0800 (PST) Content-Type: text/plain; charset=utf-8 Mime-Version: 1.0 (Mac OS X Mail 9.1 \(3096.5\)) Subject: Re: No job recovery after job manager failure From: Ufuk Celebi In-Reply-To: <826ED1E8-7AD1-4BFE-8E0C-E06E8CA47AF2@apache.org> Date: Thu, 17 Dec 2015 15:26:08 +0100 Content-Transfer-Encoding: quoted-printable Message-Id: <329670B9-E133-40DB-BB4D-0AFF939F9D5D@apache.org> References: <826ED1E8-7AD1-4BFE-8E0C-E06E8CA47AF2@apache.org> To: dev@flink.apache.org X-Mailer: Apple Mail (2.3096.5) As an update: I=E2=80=99m investigating this. Ali sent me the log files. > On 16 Dec 2015, at 18:15, Ufuk Celebi wrote: >=20 > Hey Ali, >=20 > can you send me the complete logs? >=20 > I don=E2=80=99t think it=E2=80=99s possible via the mailing list. Just = send it to my private email uce@apache.org. >=20 > =E2=80=93 Ufuk >=20 >> On 16 Dec 2015, at 17:26, Kashmar, Ali wrote: >>=20 >> Hi, >>=20 >> I=E2=80=99m trying to test HA on a 3-node Flink cluster (task slots =3D= 48). So I started a job with parallelism =3D 32 and waited for a few = seconds so that all nodes are doing work. I then shut down the node that = had the leader job manager, and by shut down I mean I powered off the = virtual machine running it. I monitored the logs to see what was going = on and I saw that zookeeper has elected a new leader. I also saw a log = for recovering jobs, but nothing actually happens. Here=E2=80=99s the = job manager log from the node that became the leader: >>=20 >> 11:06:43,448 INFO org.apache.flink.runtime.jobmanager.JobManager = - JobManager = akka.tcp://flink@192.168.200.174:56023/user/jobmanager was granted = leadership with leader session ID = Some(16eb0d0a-2cae-473e-aa41-679a87d3669b). >> 11:06:45,912 INFO = org.apache.flink.runtime.webmonitor.JobManagerRetriever - New = leader reachable under = akka.tcp://flink@192.168.200.174:56023/user/jobmanager:16eb0d0a-2cae-473e-= aa41-679a87d3669b. >> 11:06:45,963 INFO org.apache.flink.runtime.instance.InstanceManager = - Registered TaskManager at 192.168.200.174 = (akka.tcp://flink@192.168.200.174:52324/user/taskmanager) as = e8720b15c63d508e8dc19b19e70d4c88. Current number of registered hosts is = 1. Current number of alive task slots is 16. >> 11:06:45,975 INFO org.apache.flink.runtime.instance.InstanceManager = - Registered TaskManager at 192.168.200.175 = (akka.tcp://flink@192.168.200.175:46612/user/taskmanager) as = 766a7938746c2d41e817e2ceb42a9a64. Current number of registered hosts is = 2. Current number of alive task slots is 32. >> 11:08:25,925 INFO org.apache.flink.runtime.jobmanager.JobManager = - Recovering all jobs. >>=20 >>=20 >> I waited 10 minutes after that last log and there was no change. And = here=E2=80=99s the task-manager log from the same node: >>=20 >>=20 >> 11:06:45,914 INFO org.apache.flink.runtime.taskmanager.TaskManager = - Trying to register at JobManager = akka.tcp://flink@192.168.200.174:56023/user/jobmanager (attempt 1, = timeout: 500 milliseconds) >> 11:06:45,983 INFO org.apache.flink.runtime.taskmanager.TaskManager = - Successful registration at JobManager = (akka.tcp://flink@192.168.200.174:56023/user/jobmanager), starting = network stack and library cache. >> 11:06:45,988 INFO = org.apache.flink.runtime.io.network.netty.NettyClient - = Successful initialization (took 4 ms). >> 11:06:45,994 INFO = org.apache.flink.runtime.io.network.netty.NettyServer - = Successful initialization (took 6 ms). Listening on SocketAddress = /192.168.200.174:39322. >> 11:06:45,994 INFO org.apache.flink.runtime.taskmanager.TaskManager = - Determined BLOB server address to be = /192.168.200.174:48746. Starting BLOB cache. >> 11:06:45,995 INFO org.apache.flink.runtime.blob.BlobCache = - Created BLOB cache storage directory = /tmp/blobStore-4d4e4cc2-c161-4df1-acea-abda2b28d39e >>=20 >>=20 >> Is this a bug? >>=20 >> Thanks, >> Ali >=20