From user-return-24375-archive-asf-public=cust-asf.ponee.io@flink.apache.org Thu Nov 15 23:42:44 2018 Return-Path: X-Original-To: archive-asf-public@cust-asf.ponee.io Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by mx-eu-01.ponee.io (Postfix) with SMTP id E5B57180669 for ; Thu, 15 Nov 2018 23:42:42 +0100 (CET) Received: (qmail 15575 invoked by uid 500); 15 Nov 2018 22:42:41 -0000 Mailing-List: contact user-help@flink.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Delivered-To: mailing list user@flink.apache.org Received: (qmail 15565 invoked by uid 99); 15 Nov 2018 22:42:41 -0000 Received: from pnap-us-west-generic-nat.apache.org (HELO spamd4-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 15 Nov 2018 22:42:41 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd4-us-west.apache.org (ASF Mail Server at spamd4-us-west.apache.org) with ESMTP id 2A35BC0656 for ; Thu, 15 Nov 2018 22:42:41 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd4-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: 0.589 X-Spam-Level: X-Spam-Status: No, score=0.589 tagged_above=-999 required=6.31 tests=[DKIMWL_WL_MED=-1.459, DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, DKIM_VALID_EF=-0.1, FREEMAIL_ENVFROM_END_DIGIT=0.25, HTML_MESSAGE=2, RCVD_IN_DNSWL_NONE=-0.0001, SPF_HELO_PASS=-0.001, SPF_PASS=-0.001] autolearn=disabled Authentication-Results: spamd4-us-west.apache.org (amavisd-new); dkim=pass (2048-bit key) header.d=hotmail.com Received: from mx1-lw-us.apache.org ([10.40.0.8]) by localhost (spamd4-us-west.apache.org [10.40.0.11]) (amavisd-new, port 10024) with ESMTP id WapJ1Jg2Oz9w for ; Thu, 15 Nov 2018 22:42:39 +0000 (UTC) Received: from NAM04-BN3-obe.outbound.protection.outlook.com (mail-oln040092009095.outbound.protection.outlook.com [40.92.9.95]) by mx1-lw-us.apache.org (ASF Mail Server at mx1-lw-us.apache.org) with ESMTPS id E7F4960D2E for ; Thu, 15 Nov 2018 22:42:38 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=hotmail.com; s=selector1; h=From:Date:Subject:Message-ID:Content-Type:MIME-Version:X-MS-Exchange-SenderADCheck; bh=XaWzZLXf9tBIcyotCk3djlBStcrfWxseEge8bxdhZPU=; b=E/mOZK1TSbU7BMsRzM2bx6fTR0n0FzQ1PoF4QwVQTAyPTj+xmxRX5x4mkCbNNnAK1UisNZ4kCKhIuvjOZzhcfK3DMFWJ8mt+HIJ79OyKhdmqwzqztbknQg5tMoSQfXnF9RqBzHJlGO2t7QqHGX04wOb1ywmNQjF9ICjSGfgGBxeWKT4aO6EzpI2V+DBoJXxCA7KJHM644Tbe1jsXFm6w6GEOJ5UzO9LUtLDeZbJPPk4PR3PlLl5kiHUtbcVjzAme6oP6auPaDSLlxStTlncT6Kh4CiD7UCOZv9243wtGRlS8G2t0d4WR6YoZns2YGmQjXccdVW0kg4DU/93XkcJGzQ== Received: from BN3NAM04FT008.eop-NAM04.prod.protection.outlook.com (10.152.92.51) by BN3NAM04HT053.eop-NAM04.prod.protection.outlook.com (10.152.92.239) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_256_CBC_SHA384) id 15.20.1339.10; Thu, 15 Nov 2018 22:42:32 +0000 Received: from CY4PR11MB1719.namprd11.prod.outlook.com (10.152.92.53) by BN3NAM04FT008.mail.protection.outlook.com (10.152.92.168) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_256_CBC_SHA384) id 15.20.1339.10 via Frontend Transport; Thu, 15 Nov 2018 22:42:32 +0000 Received: from CY4PR11MB1719.namprd11.prod.outlook.com ([fe80::f987:ca3e:9fb8:8055]) by CY4PR11MB1719.namprd11.prod.outlook.com ([fe80::f987:ca3e:9fb8:8055%4]) with mapi id 15.20.1294.045; Thu, 15 Nov 2018 22:42:32 +0000 From: Olga Luganska To: "user@flink.apache.org" Subject: Standalone HA cluster: Fatal error occurred in the cluster entrypoint. Thread-Topic: Standalone HA cluster: Fatal error occurred in the cluster entrypoint. Thread-Index: AQHUfTL2z+W98qdemUuUmdeTMNcuxg== Date: Thu, 15 Nov 2018 22:42:31 +0000 Message-ID: Accept-Language: en-US Content-Language: en-US X-MS-Has-Attach: X-MS-TNEF-Correlator: x-incomingtopheadermarker: OriginalChecksum:1304CBCF91AE807A085BDF30BD3162000756EF0050E1D842B917B7F510647F78;UpperCasedChecksum:A6D675391F9E97CF7BD52CFDE0BBFEDF35A15B0FB982A66C46F226B4B4F7FE95;SizeAsReceived:6875;Count:43 x-tmn: [yG4mKZJy2LzoCbWLgsLpVQ9H6O1q0e3X] x-ms-publictraffictype: Email x-microsoft-exchange-diagnostics: 1;BN3NAM04HT053;6:PH2Lru2e49vGH0hj13Oz50ELcUNZA2JY1hIi34fUwWE/sfEHF0VIrJaLYS2GNK3qDnY4yiXEcofWqei5LznZnyaSUY1utBjyVxCDGOfuDaR0K6JM0QHCt6hoshaqbO5UlFqzURsH8fJrysmSiLin6qZA3qHYZOGYFc6ZmrO3mOeZmke5Ap7zYPh/unEWgox9tl0wiLxF1m0MnPgpD1FPk33Ts0Z2w47/k8NEH/FJAlk5GBYvof4yPkN6h+oAEqsbe1UEfx8yMJwGXrEdDyMzHR3N70sZyI5+qiGuWwES1z3chmPG+jO8LBikQakLQyflvxxOt81oIxaXEmVch1dUqCgnnr3DB7qsc1kml4kJmcHuvIqpAIvf6qc7OlFs35VCvpb+lFy52/ffPF5ucEY1xWu7qB/L4SIlCmyV6M98ElV9/Zu7nuvwiuVKFWNLlnMxb+uwRR6ttj586g5OdHqGrg==;5:/MTz+V5p3Dh/U6BuLCdv2TxeBX8RcvAOpBoa1YxHWHtWb2u/23Y12Qs2Y5MJ7aNnwFCXJGWeklC2R6mmUd+qBu72JPKR8pWZTn7b1LbVOfEPgX8cEeeABdfR9iCWy+rkQN4wC6mCTda58cO4NuCokOlL37lb06uJaFjsmG/U2Ro=;7:GrAqJs/FbVCvlqEnfOvwIqXk+kkgCFRxLG2Qm3r4aUPbcDNUp9ruILE/mpPl5Q1m0RUdsQSpo9XBWouGRTDHPCS7rvYLNZgAIOkCVaY9yC9FC/UgbhfAKY1hRhxifg+nUW+IWPkoPyV1mOX6QqH3Tg== x-incomingheadercount: 43 x-eopattributedmessage: 0 x-microsoft-antispam: BCL:0;PCL:0;RULEID:(2390098)(7020095)(201702061078)(5061506573)(5061507331)(1603103135)(2017031320274)(2017031324274)(2017031323274)(2017031322404)(1603101475)(1601125500)(1701031045);SRVR:BN3NAM04HT053; x-ms-traffictypediagnostic: BN3NAM04HT053: x-exchange-antispam-report-cfa-test: BCL:0;PCL:0;RULEID:(4566010)(82015058);SRVR:BN3NAM04HT053;BCL:0;PCL:0;RULEID:;SRVR:BN3NAM04HT053; x-microsoft-antispam-message-info: Hxq5nIJ9MReVnkn+GzOYPzzsEhg6w2j2GaOgL4NQzgVd0NRMSgD/maVXkQG95phb Content-Type: multipart/alternative; boundary="_000_CY4PR11MB1719CE0DE7E00C9C8469B431C7DC0CY4PR11MB1719namp_" MIME-Version: 1.0 X-OriginatorOrg: hotmail.com X-MS-Exchange-CrossTenant-RMS-PersistedConsumerOrg: 9bd8b953-1c55-4da7-b616-8bcad099ae8b X-MS-Exchange-CrossTenant-Network-Message-Id: 5f1a342b-c1e7-4aae-ba9b-08d64b4ba158 X-MS-Exchange-CrossTenant-rms-persistedconsumerorg: 9bd8b953-1c55-4da7-b616-8bcad099ae8b X-MS-Exchange-CrossTenant-originalarrivaltime: 15 Nov 2018 22:42:31.9930 (UTC) X-MS-Exchange-CrossTenant-fromentityheader: Internet X-MS-Exchange-CrossTenant-id: 84df9e7f-e9f6-40af-b435-aaaaaaaaaaaa X-MS-Exchange-Transport-CrossTenantHeadersStamped: BN3NAM04HT053 --_000_CY4PR11MB1719CE0DE7E00C9C8469B431C7DC0CY4PR11MB1719namp_ Content-Type: text/plain; charset="iso-8859-1" Content-Transfer-Encoding: quoted-printable Hello, I am running flink 1.6.1 standalone HA cluster. Today I am unable to start = cluster because of "Fatal error in cluster entrypoint" (I used to see this error when running flink 1.5 version, after upgrade to = 1.6.1 (which had a fix for this bug) everything worked well for a while) Question: what exactly needs to be done to clean "state handle store"? 2018-11-15 15:09:53,181 DEBUG org.apache.flink.runtime.rpc.akka.FencedAkkaR= pcActor - Fencing token not set: Ignoring message LocalFencedMessa= ge(null, org.apache.flink.runtime.rpc.messages.RunAsync@21fd224c) because t= he fencing token is null. 2018-11-15 15:09:53,182 ERROR org.apache.flink.runtime.entrypoint.ClusterEn= trypoint - Fatal error occurred in the cluster entrypoint. java.lang.RuntimeException: org.apache.flink.util.FlinkException: Could not= retrieve submitted JobGraph from state handle under /e13034f83a80072204fac= b2cec9ea6a3. This indicates that the retrieved state handle is broken. Try = cleaning the state handle store. at org.apache.flink.util.ExceptionUtils.rethrow(ExceptionUtils.java= :199) at org.apache.flink.util.function.FunctionUtils.lambda$uncheckedFun= ction$1(FunctionUtils.java:61) at java.util.concurrent.CompletableFuture.uniApply(CompletableFutur= e.java:602) at java.util.concurrent.CompletableFuture$UniApply.tryFire(Completa= bleFuture.java:577) at java.util.concurrent.CompletableFuture$Completion.run(Completabl= eFuture.java:442) at akka.dispatch.TaskInvocation.run(AbstractDispatcher.scala:39) at akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec= (AbstractDispatcher.scala:415) at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:= 260) at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoi= nPool.java:1339) at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.ja= va:1979) at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorke= rThread.java:107) Caused by: org.apache.flink.util.FlinkException: Could not retrieve submitt= ed JobGraph from state handle under /e13034f83a80072204facb2cec9ea6a3. This= indicates that the retrieved state handle is broken. Try cleaning the stat= e handle store. at org.apache.flink.runtime.jobmanager.ZooKeeperSubmittedJobGraphSt= ore.recoverJobGraph(ZooKeeperSubmittedJobGraphStore.java:208) at org.apache.flink.runtime.dispatcher.Dispatcher.recoverJob(Dispat= cher.java:692) at org.apache.flink.runtime.dispatcher.Dispatcher.recoverJobGraphs(= Dispatcher.java:677) at org.apache.flink.runtime.dispatcher.Dispatcher.recoverJobs(Dispa= tcher.java:658) at org.apache.flink.runtime.dispatcher.Dispatcher.lambda$null$26(Di= spatcher.java:817) at org.apache.flink.util.function.FunctionUtils.lambda$uncheckedFun= ction$1(FunctionUtils.java:59) ... 9 more Caused by: java.io.FileNotFoundException: /checkpoint_repo/ha/submittedJobG= raphdd865937d674 (No such file or directory) at java.io.FileInputStream.open0(Native Method) at java.io.FileInputStream.open(FileInputStream.java:195) at java.io.FileInputStream.(FileInputStream.java:138) at org.apache.flink.core.fs.local.LocalDataInputStream.(Local= DataInputStream.java:50) at org.apache.flink.core.fs.local.LocalFileSystem.open(LocalFileSys= tem.java:142) at org.apache.flink.runtime.state.filesystem.FileStateHandle.openIn= putStream(FileStateHandle.java:68) at org.apache.flink.runtime.state.RetrievableStreamStateHandle.open= InputStream(RetrievableStreamStateHandle.java:64) at org.apache.flink.runtime.state.RetrievableStreamStateHandle.retr= ieveState(RetrievableStreamStateHandle.java:57) at org.apache.flink.runtime.jobmanager.ZooKeeperSubmittedJobGraphSt= ore.recoverJobGraph(ZooKeeperSubmittedJobGraphStore.java:202) ... 14 more 2018-11-15 15:09:53,185 INFO org.apache.flink.runtime.blob.TransientBlobCa= che - Shutting down BLOB cache thank you, Olga --_000_CY4PR11MB1719CE0DE7E00C9C8469B431C7DC0CY4PR11MB1719namp_ Content-Type: text/html; charset="iso-8859-1" Content-Transfer-Encoding: quoted-printable
Hello,

I am running flink 1.6.1 standalone HA cluster. Today I am unable to start = cluster because of "Fatal error in cluster entrypoint"
(I used to see this error when running flink 1.5 version, after upgrade to = 1.6.1 (which had a fix for this bug) everything worked well for a while)

Question: what exactly needs to be done to clean "state handle store&q= uot;?

2018-11-15 15:09:53,181 DEBUG org.apache.fl= ink.runtime.rpc.akka.FencedAkkaRpcActor      =     - Fencing token not set: Ignoring message LocalFencedMes= sage(null, org.apache.flink.runtime.rpc.messages.RunAsync@21fd224c) because= the fencing token is null.

2018-11-15 15:09:53,182 ERROR org.apache.fl= ink.runtime.entrypoint.ClusterEntrypoint      = ;   - Fatal error occurred in the cluster entrypoint.<= /span>

java.lang.RuntimeException: org.apache.flin= k.util.FlinkException: Could not retrieve submitted JobGraph from state han= dle under /e13034f83a80072204facb2cec9ea6a3. This indicates that the retrie= ved state handle is broken. Try cleaning the state handle store.

        = at org.apache.flink.util.ExceptionUtils.rethrow(ExceptionUtils.java:199)

        = at org.apache.flink.util.function.FunctionUtils.lambda$uncheckedFunction$1(= FunctionUtils.java:61)

        = at java.util.concurrent.CompletableFuture.uniApply(CompletableFuture.java:6= 02)

        = at java.util.concurrent.CompletableFuture$UniApply.tryFire(CompletableFutur= e.java:577)

        = at java.util.concurrent.CompletableFuture$Completion.run(CompletableFuture.= java:442)

        = at akka.dispatch.TaskInvocation.run(AbstractDispatcher.scala:39)=

        = at akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(Abstrac= tDispatcher.scala:415)

        = at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)

        = at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.ja= va:1339)

        = at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)=

        = at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.= java:107)

Caused by: org.apache.flink.util.FlinkExcep= tion: Could not retrieve submitted JobGraph from state handle under /e13034= f83a80072204facb2cec9ea6a3. This indicates that the retrieved state handle = is broken. Try cleaning the state handle store.

        = at org.apache.flink.runtime.jobmanager.ZooKeeperSubmittedJobGraphStore.reco= verJobGraph(ZooKeeperSubmittedJobGraphStore.java:208)

        = at org.apache.flink.runtime.dispatcher.Dispatcher.recoverJob(Dispatcher.jav= a:692)

        = at org.apache.flink.runtime.dispatcher.Dispatcher.recoverJobGraphs(Dispatch= er.java:677)

        = at org.apache.flink.runtime.dispatcher.Dispatcher.recoverJobs(Dispatcher.ja= va:658)

        = at org.apache.flink.runtime.dispatcher.Dispatcher.lambda$null$26(Dispatcher= .java:817)

        = at org.apache.flink.util.function.FunctionUtils.lambda$uncheckedFunction$1(= FunctionUtils.java:59)

        = ... 9 more

Caused by: java.io.FileNotFoundException: /= checkpoint_repo/ha/submittedJobGraphdd865937d674 (No such file or directory= )

        = at java.io.FileInputStream.open0(Native Method)

        = at java.io.FileInputStream.open(FileInputStream.java:195)=

        = at java.io.FileInputStream.<init>(FileInputStream.java:138)

        = at org.apache.flink.core.fs.local.LocalDataInputStream.<init>(LocalDa= taInputStream.java:50)

        = at org.apache.flink.core.fs.local.LocalFileSystem.open(LocalFileSystem.java= :142)

        = at org.apache.flink.runtime.state.filesystem.FileStateHandle.openInputStrea= m(FileStateHandle.java:68)

        = at org.apache.flink.runtime.state.RetrievableStreamStateHandle.openInputStr= eam(RetrievableStreamStateHandle.java:64)

        = at org.apache.flink.runtime.state.RetrievableStreamStateHandle.retrieveStat= e(RetrievableStreamStateHandle.java:57)

        = at org.apache.flink.runtime.jobmanager.ZooKeeperSubmittedJobGraphStore.reco= verJobGraph(ZooKeeperSubmittedJobGraphStore.java:202)

        = ... 14 more

2018-11-15 15:09:53,185 INFO  org.apac= he.flink.runtime.blob.TransientBlobCache      = ;        - Shutting down BLOB cache


thank you,

Olga 


--_000_CY4PR11MB1719CE0DE7E00C9C8469B431C7DC0CY4PR11MB1719namp_--