Return-Path: X-Original-To: archive-asf-public-internal@cust-asf2.ponee.io Delivered-To: archive-asf-public-internal@cust-asf2.ponee.io Received: from cust-asf.ponee.io (cust-asf.ponee.io [163.172.22.183]) by cust-asf2.ponee.io (Postfix) with ESMTP id A791E200CE0 for ; Fri, 11 Aug 2017 02:40:49 +0200 (CEST) Received: by cust-asf.ponee.io (Postfix) id A647716C83F; Fri, 11 Aug 2017 00:40:49 +0000 (UTC) Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by cust-asf.ponee.io (Postfix) with SMTP id C65B116C83D for ; Fri, 11 Aug 2017 02:40:48 +0200 (CEST) Received: (qmail 96267 invoked by uid 500); 11 Aug 2017 00:40:47 -0000 Mailing-List: contact user-help@flink.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Delivered-To: mailing list user@flink.apache.org Received: (qmail 96257 invoked by uid 99); 11 Aug 2017 00:40:47 -0000 Received: from pnap-us-west-generic-nat.apache.org (HELO spamd3-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 11 Aug 2017 00:40:47 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd3-us-west.apache.org (ASF Mail Server at spamd3-us-west.apache.org) with ESMTP id 2FF741807EC for ; Fri, 11 Aug 2017 00:40:47 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd3-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: -2.151 X-Spam-Level: X-Spam-Status: No, score=-2.151 tagged_above=-999 required=6.31 tests=[DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, FREEMAIL_ENVFROM_END_DIGIT=0.25, RCVD_IN_DNSWL_NONE=-0.0001, RCVD_IN_MSPIKE_H2=-2.8, RCVD_IN_SORBS_SPAM=0.5, SPF_PASS=-0.001] autolearn=disabled Authentication-Results: spamd3-us-west.apache.org (amavisd-new); dkim=pass (2048-bit key) header.d=gmail.com Received: from mx1-lw-eu.apache.org ([10.40.0.8]) by localhost (spamd3-us-west.apache.org [10.40.0.10]) (amavisd-new, port 10024) with ESMTP id toWRz2Nqw1Y7 for ; Fri, 11 Aug 2017 00:40:45 +0000 (UTC) Received: from mail-pf0-f195.google.com (mail-pf0-f195.google.com [209.85.192.195]) by mx1-lw-eu.apache.org (ASF Mail Server at mx1-lw-eu.apache.org) with ESMTPS id 11E3F5F3CC for ; Fri, 11 Aug 2017 00:40:45 +0000 (UTC) Received: by mail-pf0-f195.google.com with SMTP id t83so2034115pfj.3 for ; Thu, 10 Aug 2017 17:40:44 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=mime-version:subject:from:in-reply-to:date:cc :content-transfer-encoding:message-id:references:to; bh=dwSzqlGPNylaia8CjPzUB8Vyf/8GxGRDDO0lOTnRAWY=; b=TCDppP12hZ3rfJA3UGBGFr9VX6YGURna2J3C08DIExRCHd8zW3L+pQLV1bUuuNelpR TDzSDe4a+7ZdjxfXRqJUgiVDEnjWdFeMcjTSbAeHe1Tl4OvRJkc7EQxOk/GJUlFIk0J7 YwRIND0EEtlKw4niC0LMbfCW0UQzXUycfgEKeee2AVnienofbzA2VYr6LRM87PFNbcMi xWbgUirZ7kiy7w582np8X33RkpX+n6UTk68en6Sy+ybi4QNqeH/mUG26NK/s45gqjPGx a4yYxYoRv/itC66P0eGYEjanGZhTPocSx5cqdKJ7BCf1TxWsJU4uIwmoB5m3lyt4fN3g GZtw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:subject:from:in-reply-to:date:cc :content-transfer-encoding:message-id:references:to; bh=dwSzqlGPNylaia8CjPzUB8Vyf/8GxGRDDO0lOTnRAWY=; b=OfcnXOPG0zFnjnHNU3h7R4uQnac3DDI9wMi0SbZ91zu1/7WMrRUh1XpQftAkhzfHyl TOsAg2fVDS/sX5NPYDq5Dfa5IwING1ikIXTnQVwv5gQ9tlf/9uIVJGVrTf+lGMOOepQ4 hg9v/XX6+q01fE0sU1m1zciEqiJa6GnmW1RHewy7Wbd0UU4RKhAgxRiFF7xB0FJW9rg/ ztSPKT/GLyEF42CuTCw4Qw7j6/Re21I4ccbDMDSK1dU+XpDD+o0ZL3Edc2ZeN7KNoKCb div52+pOq6dztF5dvcf+S4lNLfathcMO9PmRJjeH0oCIrs90bB3Rnd7GZwkzCNMHYKDh rTGA== X-Gm-Message-State: AHYfb5jn7xgRJDdCosBjYzdrjPzb1TmQcLDYGmmB/6FhBQSJhKUsu7Og H9j7X4caJnEAjMb6ctc= X-Received: by 10.101.89.69 with SMTP id g5mr13640604pgu.270.1502412043505; Thu, 10 Aug 2017 17:40:43 -0700 (PDT) Received: from vpn-10-50-27-91.sea19.amazon.com (54-240-196-169.amazon.com. [54.240.196.169]) by smtp.gmail.com with ESMTPSA id f2sm12572324pgc.17.2017.08.10.17.40.41 (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Thu, 10 Aug 2017 17:40:42 -0700 (PDT) Content-Type: text/plain; charset=utf-8 Mime-Version: 1.0 (Mac OS X Mail 10.3 \(3273\)) Subject: Re: Keyed CEP checkpoint fails From: Daiqing Li In-Reply-To: <9EC9F61F-FD61-4FFD-9C22-17D47FC711E7@gmail.com> Date: Thu, 10 Aug 2017 20:40:41 -0400 Cc: FlinkUserGroup , Kostas Kloudas Content-Transfer-Encoding: quoted-printable Message-Id: References: <5795594C-BE41-49B6-91D8-C32BD161E93E@gmail.com> <9609722D-BEB4-4313-BEBA-320CC76120C4@gmail.com> <9EC9F61F-FD61-4FFD-9C22-17D47FC711E7@gmail.com> To: Dawid Wysakowicz X-Mailer: Apple Mail (2.3273) archived-at: Fri, 11 Aug 2017 00:40:49 -0000 Hi Dawid, After rewriting dashcode with Objects.hash for all the fields, I still = get the same error. One thing special is checkpoints always fail at 428, = after trying many times. Does it mean anything? > On Aug 10, 2017, at 9:14 AM, Dawid Wysakowicz = wrote: >=20 > Yes, with the information I have, the conclusion would be the same, = that I think the reason is problem with hashcode. Without some data to = reproduce it unfortunately I won=E2=80=99t be able to help you further. = I could just advise you to debug the method SharedBuffer#serialize and = pay attention to the entryID map. >=20 >> On 10 Aug 2017, at 14:54, Daiqing Li wrote: >>=20 >> Oh sorry, the data in {} is not empty because I hide private = information about my model. Do you have that same conclusion? >>> On Aug 10, 2017, at 8:52 AM, Dawid Wysakowicz = wrote: >>>=20 >>> You are right, I won=E2=80=99t be able to reproduce this problem = without data. One thing I can tell though that I think the problem is = indeed with the hashcode. Unforunately I don=E2=80=99t know Gson, but = one strange thing I noticed is the exception message: = SharedBufferEntry(ValueTimeWrapper({}, 1502298303586, 0), = [SharedBufferEdge(null, 1)], 1). The first {} is your event.toString, = which seems odd as if your event was empty. >>>=20 >>> Generally speaking as I understand this Exception is thrown because = the hashcode of your event changes during serialization, and access to = some internal temporary cache is broken. >>>=20 >>>> On 10 Aug 2017, at 14:29, Daiqing Li = wrote: >>>>=20 >>>> Hi, >>>>=20 >>>> Here is the code. But I am not sure if you can reproduce the = problem without data source. >>>>=20 >>>> Best, >>>> Daiqing >>>>=20 >>>> On Thu, Aug 10, 2017 at 8:15 AM, Dawid Wysakowicz = wrote: >>>> As @Kostas asked in your previous thread would be possible for you = to share your code for that job or at least a minimal example to = reproduce this behaviour. I fear we won=E2=80=99t be able to help you = without any further info. >>>>=20 >>>> Regards, >>>> Dawid >>>>=20 >>>>> On 10 Aug 2017, at 14:10, Daiqing Li = wrote: >>>>>=20 >>>>> Hi Flink user, >>>>>=20 >>>>> I am using FsStateBackend and AWS EMR YARN to run a CEP job. I got = this exception after running for a while. Could anyone give me some help = to debug this? I try parallelism 1, and it has the same problem. I also = try reimplemented hashcode and equals method. I use UUID as hashcode = right now. >>>>> 2017-08-09 18:15:04,572 INFO = org.apache.flink.runtime.executiongraph.ExecutionGraph - = KeyedCEPPatternOperator -> Map -> Sink: Unnamed (3/4) = (d4749a4c3469732a2a5edf40b83f88 >>>>> d4) switched from RUNNING to FAILED. >>>>> AsynchronousException{java. >>>>> lang.Exception: Could not materialize checkpoint 946 for operator = KeyedCEPPatternOperator -> Map -> Sink: Unnamed (3/4).} >>>>> at org.apache.flink.streaming. >>>>> runtime.tasks.StreamTask$AsyncCheckpointRunnable.run( >>>>> StreamTask.java:970) >>>>> at java.util.concurrent. >>>>> Executors$RunnableAdapter. >>>>> call(Executors.java:511) >>>>> at java.util.concurrent. >>>>> FutureTask.run(FutureTask. >>>>> java:266) >>>>> at java.util.concurrent. >>>>> ThreadPoolExecutor.runWorker( >>>>> ThreadPoolExecutor.java:1149) >>>>> at java.util.concurrent. >>>>> ThreadPoolExecutor$Worker.run( >>>>> ThreadPoolExecutor.java:624) >>>>> at java.lang.Thread.run(Thread. >>>>> java:748) >>>>> Caused by: java.lang.Exception: Could not materialize checkpoint = 946 for operator KeyedCEPPatternOperator -> Map -> Sink: Unnamed (3/4). >>>>> ... 6 more >>>>> Caused by: java.util.concurrent. >>>>> ExecutionException: java.lang.IllegalStateException: Could not = find id for entry: SharedBufferEntry( >>>>> ValueTimeWrapper({}, 1502298303586, 0), [SharedBufferEdge(null, = 1)], 1) >>>>> at java.util.concurrent. >>>>> FutureTask.report(FutureTask. >>>>> java:122) >>>>> at java.util.concurrent. >>>>> FutureTask.get(FutureTask. >>>>> java:192) >>>>> at org.apache.flink.util. >>>>> FutureUtil.runIfNotDoneAndGet( >>>>> FutureUtil.java:43) >>>>> at org.apache.flink.streaming. >>>>> runtime.tasks.StreamTask$AsyncCheckpointRunnable.run( >>>>> StreamTask.java:897) >>>>> ... 5 more >>>>> Suppressed: java.lang.Exception: Could not properly cancel = managed keyed state future. >>>>> at org.apache.flink.streaming. >>>>> = api.operators.OperatorSnapshotResult.cancel(OperatorSnapshotResult.java: >>>>> 90) >>>>> at org.apache.flink.streaming. >>>>> runtime.tasks.StreamTask$AsyncCheckpointRunnable. >>>>> cleanup(StreamTask.java:1023) >>>>> at org.apache.flink.streaming. >>>>> runtime.tasks.StreamTask$AsyncCheckpointRunnable.run( >>>>> StreamTask.java:961) >>>>> ... 5 more >>>>>=20 >>>>=20 >>>>=20 >>>> >>>=20 >>=20 >=20