Return-Path: X-Original-To: archive-asf-public-internal@cust-asf2.ponee.io Delivered-To: archive-asf-public-internal@cust-asf2.ponee.io Received: from cust-asf.ponee.io (cust-asf.ponee.io [163.172.22.183]) by cust-asf2.ponee.io (Postfix) with ESMTP id 922B0200C01 for ; Thu, 19 Jan 2017 13:17:07 +0100 (CET) Received: by cust-asf.ponee.io (Postfix) id 90B23160B54; Thu, 19 Jan 2017 12:17:07 +0000 (UTC) Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by cust-asf.ponee.io (Postfix) with SMTP id DAC0D160B42 for ; Thu, 19 Jan 2017 13:17:06 +0100 (CET) Received: (qmail 53530 invoked by uid 500); 19 Jan 2017 12:17:05 -0000 Mailing-List: contact user-help@flink.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@flink.apache.org Delivered-To: mailing list user@flink.apache.org Received: (qmail 53520 invoked by uid 99); 19 Jan 2017 12:17:05 -0000 Received: from pnap-us-west-generic-nat.apache.org (HELO spamd3-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 19 Jan 2017 12:17:05 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd3-us-west.apache.org (ASF Mail Server at spamd3-us-west.apache.org) with ESMTP id 74C8A1804B1 for ; Thu, 19 Jan 2017 12:17:05 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd3-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: -0.121 X-Spam-Level: X-Spam-Status: No, score=-0.121 tagged_above=-999 required=6.31 tests=[DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, RCVD_IN_DNSWL_NONE=-0.0001, RCVD_IN_MSPIKE_H3=-0.01, RCVD_IN_MSPIKE_WL=-0.01, SPF_PASS=-0.001] autolearn=disabled Authentication-Results: spamd3-us-west.apache.org (amavisd-new); dkim=pass (1024-bit key) header.d=eniro.com Received: from mx1-lw-eu.apache.org ([10.40.0.8]) by localhost (spamd3-us-west.apache.org [10.40.0.10]) (amavisd-new, port 10024) with ESMTP id NmRIWzYTEjwL for ; Thu, 19 Jan 2017 12:17:04 +0000 (UTC) Received: from mail-lf0-f53.google.com (mail-lf0-f53.google.com [209.85.215.53]) by mx1-lw-eu.apache.org (ASF Mail Server at mx1-lw-eu.apache.org) with ESMTPS id A075A5FB1E for ; Thu, 19 Jan 2017 12:17:03 +0000 (UTC) Received: by mail-lf0-f53.google.com with SMTP id z134so34892389lff.3 for ; Thu, 19 Jan 2017 04:17:03 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=eniro.com; s=google; h=from:mime-version:subject:message-id:date:to :content-transfer-encoding; bh=D7HhuaVmootfdVhnk0FxjCtHbYyOtry6yK96va3OMsg=; b=cgzpNjipx8L7JSIQgZhmfpz/mOcbOpEK37fBqe8DGmD4nLrKfhSBXNFVmasmkQJzse hBOhiDNeT66wSPwhZk11lVJxvfSTprNUewLrXtLfxTs2wn3oENAg4L76Dl7CDZAyG3CU wWz+kxmY6EZhO4V33okEUc4Oj2B6te5CyckD4= X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:from:mime-version:subject:message-id:date:to :content-transfer-encoding; bh=D7HhuaVmootfdVhnk0FxjCtHbYyOtry6yK96va3OMsg=; b=GkT2hjDksxn8CGfLPnhnegL7NALo/jP0pFrlxn218gYQHuFsSgD0V6k5XzRo29jvSw q1RmKeFCOR7TAnIVZzvpi4/CQyyRTlbhtFkGWMgCHSnmu5fr641PHsRzz6pkRuSEG+ru iT1D7hTDmeO4ulL2CBFCsGz8KsoV1cxoMcsKBy57zgF/PM0Zlzj8BNJA8730zXER5TwI LTAzUTtrggSKpRpJYDgCExfTUGbwsIeJKVfVu4uSL+fiIbsUtRZMEqMlA4uSv274K306 u/sjsYbbJ5rnoaqX4J8NBJJEJ9N3mWmesKVBrRgWXnf0I6vVZRXj27IGT9bslRIsB0BB Xq6A== X-Gm-Message-State: AIkVDXKgYmmcS3xVC/V0QbejZN8GmJ7EIH9smGwebnnBtfd7hZ2/McjGmCmm1452nzBfIj9CaDgP4oekrkxnWu/T7SyHGPfkbi2UlqPRmQkZaTN79W/y3SBjeYSG05mseDQ= X-Received: by 10.25.68.1 with SMTP id r1mr2032511lfa.86.1484828220799; Thu, 19 Jan 2017 04:17:00 -0800 (PST) Received: from [172.31.126.191] ([80.69.231.10]) by smtp.gmail.com with ESMTPSA id n8sm1706433lfi.30.2017.01.19.04.16.59 for (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Thu, 19 Jan 2017 04:17:00 -0800 (PST) From: Andrew Ge Wu Mime-Version: 1.0 (Mac OS X Mail 10.2 \(3259\)) Subject: Cluster failure after zookeeper glitch. Message-Id: Date: Thu, 19 Jan 2017 13:16:57 +0100 To: user@flink.apache.org X-Mailer: Apple Mail (2.3259) Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: quoted-printable archived-at: Thu, 19 Jan 2017 12:17:07 -0000 Hi, We recently had several zookeeper glitch, when that happens it seems to tak= e flink cluster with it. We are running on 1.03 It started like this: 2017-01-19 11:52:13,047 INFO org.apache.zookeeper.ClientCnxn = - Unable to read additional data from server sessionid 0x1= 59b505820a0008, likely server has closed socket, closing socket connection = and attempting reconnect 2017-01-19 11:52:13,047 INFO org.apache.zookeeper.ClientCnxn = - Unable to read additional data from server sessionid 0x1= 59b505820a0009, likely server has closed socket, closing socket connection = and attempting reconnect 2017-01-19 11:52:13,151 INFO org.apache.flink.shaded.org.apache.curator.fr= amework.state.ConnectionStateManager - State change: SUSPENDED 2017-01-19 11:52:13,151 INFO org.apache.flink.shaded.org.apache.curator.fr= amework.state.ConnectionStateManager - State change: SUSPENDED 2017-01-19 11:52:13,166 WARN org.apache.flink.runtime.jobmanager.ZooKeeper= SubmittedJobGraphStore - ZooKeeper connection SUSPENDED. Changes to the su= bmitted job graphs are not monitored (temporarily). 2017-01-19 11:52:13,169 INFO org.apache.flink.runtime.jobmanager.JobManage= r - JobManager akka://flink/user/jobmanager#1976923422 was r= evoked leadership. 2017-01-19 11:52:13,179 INFO org.apache.flink.runtime.executiongraph.Execu= tionGraph - op1 -> (Map, Map -> op2) (18/24) (5336dd375eb12616c5a0e9= 3c84f93465) switched from RUNNING to FAILED Then our web-ui stopped serving and job manager stuck in an exception loop = like this: 2017-01-19 13:05:13,521 WARN org.apache.flink.runtime.jobmanager.JobManage= r - Discard message LeaderSessionMessage(0318ecf5-7069-41b2-= a793-2f24bdbaa287,01/19/2017 13:05:13 Job execution switched to status = RESTARTING.) because the expected leader session I D None did not equal the received leader session ID Some(0318ecf5-7069-41b2= -a793-2f24bdbaa287). 2017-01-19 13:05:13,521 INFO org.apache.flink.runtime.executiongraph.resta= rt.FixedDelayRestartStrategy - Delaying retry of job execution for xxxxx m= s =E2=80=A6 Is it because we misconfigured anything? or this is expected behavior? When= this happens we have to restart the cluster to bring it back. Thanks! Andrew --=20 Confidentiality Notice: This e-mail transmission may contain confidential= =20 or legally privileged information that is intended only for the individual= =20 or entity named in the e-mail address. If you are not the intended=20 recipient, you are hereby notified that any disclosure, copying,=20 distribution, or reliance upon the contents of this e-mail is strictly=20 prohibited and may be unlawful. If you have received this e-mail in error,= =20 please notify the sender immediately by return e-mail and delete all copies= =20 of this message.