From user-return-20386-archive-asf-public=cust-asf.ponee.io@flink.apache.org  Tue Jun  5 12:01:05 2018
Return-Path: <user-return-20386-archive-asf-public=cust-asf.ponee.io@flink.apache.org>
X-Original-To: archive-asf-public@cust-asf.ponee.io
Delivered-To: archive-asf-public@cust-asf.ponee.io
Received: from mail.apache.org (hermes.apache.org [140.211.11.3])
	by mx-eu-01.ponee.io (Postfix) with SMTP id A930F180674
	for <archive-asf-public@cust-asf.ponee.io>; Tue,  5 Jun 2018 12:01:04 +0200 (CEST)
Received: (qmail 62252 invoked by uid 500); 5 Jun 2018 10:01:03 -0000
Mailing-List: contact user-help@flink.apache.org; run by ezmlm
Precedence: bulk
List-Help: <mailto:user-help@flink.apache.org>
List-Unsubscribe: <mailto:user-unsubscribe@flink.apache.org>
List-Post: <mailto:user@flink.apache.org>
List-Id: <user.flink.apache.org>
Delivered-To: mailing list user@flink.apache.org
Received: (qmail 62178 invoked by uid 99); 5 Jun 2018 10:01:02 -0000
Received: from pnap-us-west-generic-nat.apache.org (HELO spamd3-us-west.apache.org) (209.188.14.142)
    by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 05 Jun 2018 10:01:02 +0000
Received: from localhost (localhost [127.0.0.1])
	by spamd3-us-west.apache.org (ASF Mail Server at spamd3-us-west.apache.org) with ESMTP id 550E018084F
	for <user@flink.apache.org>; Tue,  5 Jun 2018 10:01:02 +0000 (UTC)
X-Virus-Scanned: Debian amavisd-new at spamd3-us-west.apache.org
X-Spam-Flag: NO
X-Spam-Score: 3.08
X-Spam-Level: ***
X-Spam-Status: No, score=3.08 tagged_above=-999 required=6.31
	tests=[DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1,
	HTML_MESSAGE=2, HTML_OBFUSCATE_05_10=0.001, KAM_LINEPADDING=1.2,
	RCVD_IN_DNSWL_NONE=-0.0001, RCVD_IN_MSPIKE_H3=-0.01,
	RCVD_IN_MSPIKE_WL=-0.01, SPF_PASS=-0.001] autolearn=disabled
Authentication-Results: spamd3-us-west.apache.org (amavisd-new);
	dkim=pass (2048-bit key) header.d=gmail.com
Received: from mx1-lw-eu.apache.org ([10.40.0.8])
	by localhost (spamd3-us-west.apache.org [10.40.0.10]) (amavisd-new, port 10024)
	with ESMTP id vMCu9_PBFG3Q for <user@flink.apache.org>;
	Tue,  5 Jun 2018 10:00:59 +0000 (UTC)
Received: from mail-lf0-f67.google.com (mail-lf0-f67.google.com [209.85.215.67])
	by mx1-lw-eu.apache.org (ASF Mail Server at mx1-lw-eu.apache.org) with ESMTPS id 81A325F27B
	for <user@flink.apache.org>; Tue,  5 Jun 2018 10:00:59 +0000 (UTC)
Received: by mail-lf0-f67.google.com with SMTP id n3-v6so2572043lfe.12
        for <user@flink.apache.org>; Tue, 05 Jun 2018 03:00:59 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=gmail.com; s=20161025;
        h=mime-version:in-reply-to:references:from:date:message-id:subject:to
         :cc;
        bh=XJ/iP5qoCxkUb1jl4/CVcUqzeKKgLQm2Bvw5MfJTFDY=;
        b=EkRkjJiH7sjfQSni50iKxBrZ0rePgkRFq+99BXEUm8bjpp0sMkKFXbeisRMSaIemjc
         s57xhFiYvWu7Z2QrspWUfac6lkuNCaowBJRBVz1js0kU66W60msO2vne/K/DRtINNlkZ
         g7drw5ab5TyChhXsxm480ozrBaXoouHfnVy621+qeoWOfftjvz5a0M2v86edXtoHv95W
         C2TUFOpZLQDP2mcT6yTNzrRPBQ6CjMIr/xtefOZMKRevUzIA8i8YtQBLS7sXtnXGNVLe
         TidWxTaRNuvIZIFhWXhlf3D5iF7qNhZ3yr+odtjR3Jlc5w3oiM9+VMuksagrzyUdqTWO
         KYvw==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20161025;
        h=x-gm-message-state:mime-version:in-reply-to:references:from:date
         :message-id:subject:to:cc;
        bh=XJ/iP5qoCxkUb1jl4/CVcUqzeKKgLQm2Bvw5MfJTFDY=;
        b=fE2djhL0oci+ogobSM1oZ7TW7QU3M6V4Z2v9eJuqtavF/tR/Y4RrG+ILyz4Cc/5Yzs
         B0cm/XTtgUp89+0opvZZUYWLKcK5SiBAiMGQcplyf2gvvbgGj1IwoQoxqb75eyYqAjHp
         0luQ97/D47hdR5onY0MPTqQuSuYMB5rOOg8Y9dEOCXrc8HfzocQc+2VIy8TINUttNDYD
         4ooWlHIRvqI3oY9QrrCUbgm4Bi5butQhyYUFDmll8/0u2oVjnoD7JyBCbgNAGj6nv/AE
         d851gf4ngcQyghBekYgVN86aPvH9O0fhSgn6SP5rMx7LyirI5f7C6/WMs0rAssb/FVbZ
         nL8A==
X-Gm-Message-State: APt69E1VhGgJrMfZke39KMGI1U/kWDFrx5BKqn3W8F0v0h+cbF6wyosb
	nR77AjOus46yRHoFC1NrtTdNemg3Ria5eJeahhU=
X-Google-Smtp-Source: ADUXVKKNdhMEd+t0+GUq580y8aqM+MzrLnnr0wX3Q7xlpnoPcGne/VE6CpF/j2X+hZTXhqbvWClSub/bAxT1IEEWgLY=
X-Received: by 2002:a19:1099:: with SMTP id 25-v6mr1195133lfq.112.1528192859009;
 Tue, 05 Jun 2018 03:00:59 -0700 (PDT)
MIME-Version: 1.0
Received: by 2002:a2e:2415:0:0:0:0:0 with HTTP; Tue, 5 Jun 2018 03:00:18 -0700 (PDT)
In-Reply-To: <SL2P216MB0475041230825B2DAE05ACAEA5670@SL2P216MB0475.KORP216.PROD.OUTLOOK.COM>
References: <SL2P216MB0475E45D707B3CD4DDCA5C0CA56C0@SL2P216MB0475.KORP216.PROD.OUTLOOK.COM>
 <SL2P216MB047507EB19E3B8E47A74B31AA56C0@SL2P216MB0475.KORP216.PROD.OUTLOOK.COM>
 <23a628a6-113b-3072-b366-13b655adc2bf@apache.org> <CAAdrtT2KgEojHUE_TCM_36oGBKQo4xqo76Y_nHQoTLmr4A+8WQ@mail.gmail.com>
 <SL2P216MB04751D1C81E0D2760AEFBDD5A5630@SL2P216MB0475.KORP216.PROD.OUTLOOK.COM>
 <CAAdrtT1imDWO9fLrz=pxA77QhAbfT7TGKGNXLawFVK12tLrPnQ@mail.gmail.com>
 <SL2P216MB04759108C439C224A12E478AA5670@SL2P216MB0475.KORP216.PROD.OUTLOOK.COM>
 <SL2P216MB0475041230825B2DAE05ACAEA5670@SL2P216MB0475.KORP216.PROD.OUTLOOK.COM>
From: Fabian Hueske <fhueske@gmail.com>
Date: Tue, 5 Jun 2018 12:00:18 +0200
Message-ID: <CAAdrtT0GeFgtYkdWjHYOaone3YoYB6H2JtbrshHehpu9cr9Jcw@mail.gmail.com>
Subject: Re: NPE in flink sql over-window
To: "Yan Zhou [FDS Science]" <yzhou@coupang.com>
Cc: Dawid Wysakowicz <dwysakowicz@apache.org>, user <user@flink.apache.org>
Content-Type: multipart/alternative; boundary="000000000000827b2e056de220ae"

--000000000000827b2e056de220ae
Content-Type: text/plain; charset="UTF-8"

Hi Yan,

Thanks for providing the logs and opening the JIRA issue!
Let's continue the discussion there.

Best, Fabian

2018-06-05 1:26 GMT+02:00 Yan Zhou [FDS Science] <yzhou@coupang.com>:

> Hi Fabian,
>
> I added some trace logs in ProcTimeBoundedRangeOver and think it should
> be a bug. The log should explain how cleanup_time_1 bypasses the needToCleanupState
> check and causes NPE. A jira ticket [1] is created.
>
> Best
> Yan
>
>
> *[ts:1528149296456] [label:state_ttl_update] register for cleanup at
> 1528150096456(CLEANUP_TIME_1), because of Row:(orderId:001,userId:U123)*
> *[ts:1528149296456] [label:register_pt] register for process input at
> 1528149296457, because of Row:(orderId:001,userId:U123)*
> *[ts:1528149296458] [label:state_apply] ontimer at 1528149296457, apply
> Row:(orderId:001,userId:U123) to accumulator*
>
> *[ts:1528149885813] [label:state_ttl_update] register at
> 1528150685813(CLEANUP_TIME_2), because of Row:(orderId:002,userId:U123)*
> *[ts:1528149885813] [label:register_pt] register for process input at
> 1528149885814, because of Row:(orderId:002,userId:U123)*
> *[ts:1528149885814] [label:state_apply] ontimer at 1528149885814, apply
> Row:(orderId:002,userId:U123) to accumulator*
>
> *[ts:1528150096460] [label:NO_ELEMENTS_IN_STATE] ontimer at
> 1528150096456(CLEANUP_TIME_1), bypass needToCleanupState check, however
> rowMapState is {key:1528150096455, value:[]}*
>
> *[ts:1528150685815] [label:state_timeout] ontimer at
> 1528150685813(CLEANUP_TIME_2), clean/empty the rowMapState
> [{key:1528149885813, value:[Row:(orderId:002,userId:U123)]}]*
>
>
>
>
>
>
>
>
> [1] : https://issues.apache.org/jira/browse/FLINK-9524
>
>
> ------------------------------
> *From:* Yan Zhou [FDS Science] <yzhou@coupang.com>
> *Sent:* Monday, June 4, 2018 4:05 PM
> *To:* Fabian Hueske
>
> *Cc:* Dawid Wysakowicz; user
> *Subject:* Re: NPE in flink sql over-window
>
>
> Hi Fabian,
>
>
> Yes, the NPE cause the job failure and recovery( instead of being the
> result of a recovery).
>
> And yet, during the recovery, the same exceptions are thrown from same
> line.
>
>
> Best
>
> Yan
> ------------------------------
> *From:* Fabian Hueske <fhueske@gmail.com>
> *Sent:* Thursday, May 31, 2018 12:09:03 AM
> *To:* Yan Zhou [FDS Science]
> *Cc:* Dawid Wysakowicz; user
> *Subject:* Re: NPE in flink sql over-window
>
> Hi Yan,
>
> Thanks for the details and for digging into the issue.
> If I got it right, the NPE caused the job failure and recovery (instead of
> being the result of a recovery), correct?
>
> Best, Fabian
>
> 2018-05-31 7:00 GMT+02:00 Yan Zhou [FDS Science] <yzhou@coupang.com>:
>
> Thanks for the replay.
>
>
> Yes, it only happen if I config the idle state retention times. The error
> occurs the first time before the first recovery. I haven't run with
> proctime but rowtime in flink 1.4.x. I am not sure if it will cause
> problems with proctime in 1.4.x.
>
>
> I am adding some trace log for ProcTimeBoundedRangeOver. I will update
> with my test result and fire a JIRA after that.
>
>
> Best
>
> Yan
> ------------------------------
> *From:* Fabian Hueske <fhueske@gmail.com>
> *Sent:* Wednesday, May 30, 2018 1:43:01 AM
> *To:* Dawid Wysakowicz
> *Cc:* user
> *Subject:* Re: NPE in flink sql over-window
>
> Hi,
>
> Dawid's analysis is certainly correct, but looking at the code this should
> not happen.
>
> I have a few questions:
> - You said this only happens if you configure idle state retention times,
> right?
> - Does the error occur the first time without a previous recovery?
> - Did you run the same query on Flink 1.4.x without any problems?
>
> Thanks, Fabian
>
> 2018-05-30 9:25 GMT+02:00 Dawid Wysakowicz <dwysakowicz@apache.org>:
>
> Hi Yan,
>
>
> I think it is a bug in the ProcTimeBoundedRangeOver. It tries to access a
> list of elements that was already cleared and does not check against null.
> Could you please file a JIRA for that?
>
>
> Best,
>
> Dawid
>
> On 30/05/18 08:27, Yan Zhou [FDS Science] wrote:
>
> I also get warnning that CodeCache is full around that time. It's printed
> by JVM and doesn't have timestamp. But I suspect that it's because so
> many failure recoveries from checkpoint and the sql queries are dynamically
> compiled too many times.
>
>
>
> *Java HotSpot(TM) 64-Bit Server VM warning: CodeCache is full. Compiler
> has been disabled.*
> *Java HotSpot(TM) 64-Bit Server VM warning: Try increasing the code cache
> size using -XX:ReservedCodeCacheSize=*
> *CodeCache: size=245760Kb used=244114Kb max_used=244146Kb free=1645Kb*
> *bounds [0x00007fa4fd000000, 0x00007fa50c000000, 0x00007fa50c000000]*
> *total_blobs=54308 nmethods=53551 adapters=617*
> *compilation: disabled (not enough contiguous free space left)*
>
>
>
> ------------------------------
> *From:* Yan Zhou [FDS Science] <yzhou@coupang.com> <yzhou@coupang.com>
> *Sent:* Tuesday, May 29, 2018 10:52:18 PM
> *To:* user@flink.apache.org
> *Subject:* NPE in flink sql over-window
>
>
> Hi,
>
> I am using flink sql 1.5.0. My application throws NPE. And after it
> recover from checkpoint automatically, it throws NPE immediately from same
> line of code.
>
>
> My application read message from kafka, convert the datastream into a
> table, issue an Over-window aggregation and write the result into a sink.
> NPE throws from class ProcTimeBoundedRangeOver. Please see exception log
> at the bottom.
>
>
> The exceptions always happens after the application started for *maxIdleStateRetentionTime
> *time.  What could be the possible causes?
>
>
> Best
>
> Yan
>
>
> *2018-05-27 11:03:37,656 INFO  org.apache.flink.runtime.taskmanager.Task
>                    - over: (PARTITION BY: uid, ORDER BY: proctime,
> RANGEBETWEEN 86400000 PRECEDI*
> *NG AND CURRENT ROW, select: (id, uid, proctime, group_concat($7) AS
> w0$o0)) -> select: *
> *(id, uid, proctime, w0$o0 AS EXPR$3) -> to: Row -> Flat Map -> Filter ->
> Sink: Unnamed (3/15) (327*
> *efe96243bbfdf1f1e40a3372f64aa) switched from RUNNING to FAILED.*
> *TimerException{java.lang.NullPointerException}*
> *       at
> org.apache.flink.streaming.runtime.tasks.SystemProcessingTimeService$TriggerTask.run(SystemProcessingTimeService.java:284)*
> *       at
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)*
> *       at java.util.concurrent.FutureTask.run(FutureTask.java:266)*
> *       at
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180)*
> *       at
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)*
> *       at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)*
> *       at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)*
> *       at java.lang.Thread.run(Thread.java:748)*
> *Caused by: java.lang.NullPointerException*
> *       at
> org.apache.flink.table.runtime.aggregate.ProcTimeBoundedRangeOverWithLog.onTimer(ProcTimeBoundedRangeOver.scala:181)*
> *       at
> org.apache.flink.streaming.api.operators.LegacyKeyedProcessOperator.invokeUserFunction(LegacyKeyedProcessOperator.java:97)*
> *       at
> org.apache.flink.streaming.api.operators.LegacyKeyedProcessOperator.onProcessingTime(LegacyKeyedProcessOperator.java:81)*
> *       at
> org.apache.flink.streaming.api.operators.HeapInternalTimerService.onProcessingTime(HeapInternalTimerService.java:266)*
> *       at
> org.apache.flink.streaming.runtime.tasks.SystemProcessingTimeService$TriggerTask.run(SystemProcessingTimeService.java:281)*
>
>
>
>
>
>
>

--000000000000827b2e056de220ae
Content-Type: text/html; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable

<div dir=3D"ltr"><div><div><div>Hi Yan,<br><br></div>Thanks for providing t=
he logs and opening the JIRA issue!<br></div>Let&#39;s continue the discuss=
ion there.<br></div><div><br></div><div>Best, Fabian<br></div></div><div cl=
ass=3D"gmail_extra"><br><div class=3D"gmail_quote">2018-06-05 1:26 GMT+02:0=
0 Yan Zhou [FDS Science] <span dir=3D"ltr">&lt;<a href=3D"mailto:yzhou@coup=
ang.com" target=3D"_blank">yzhou@coupang.com</a>&gt;</span>:<br><blockquote=
 class=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;border-left:1px #ccc soli=
d;padding-left:1ex">


<div dir=3D"ltr">
<div id=3D"m_-8272178736825963492divtagdefaultwrapper" style=3D"font-size:1=
2pt;color:#000000;font-family:Calibri,Helvetica,sans-serif" dir=3D"ltr">
<div id=3D"m_-8272178736825963492divtagdefaultwrapper" style=3D"font-size:1=
2pt;color:rgb(0,0,0);font-family:Calibri,Helvetica,sans-serif,Helvetica,Emo=
jiFont,&quot;Apple Color Emoji&quot;,&quot;Segoe UI Emoji&quot;,NotoColorEm=
oji,&quot;Segoe UI Symbol&quot;,&quot;Android Emoji&quot;,EmojiSymbols" dir=
=3D"ltr">
Hi Fabian,</div>
<div id=3D"m_-8272178736825963492divtagdefaultwrapper" style=3D"font-size:1=
2pt;color:rgb(0,0,0);font-family:Calibri,Helvetica,sans-serif,Helvetica,Emo=
jiFont,&quot;Apple Color Emoji&quot;,&quot;Segoe UI Emoji&quot;,NotoColorEm=
oji,&quot;Segoe UI Symbol&quot;,&quot;Android Emoji&quot;,EmojiSymbols" dir=
=3D"ltr">
<br>
<div>I added some trace logs in=C2=A0<span>ProcTimeBoundedRangeOver and thi=
nk it should be a bug. The log should explain how=C2=A0<span style=3D"font-=
family:Calibri,Helvetica,sans-serif,Helvetica,EmojiFont,&quot;Apple Color E=
moji&quot;,&quot;Segoe UI Emoji&quot;,NotoColorEmoji,&quot;Segoe UI Symbol&=
quot;,&quot;Android Emoji&quot;,EmojiSymbols;font-size:16px">cleanup_time_1
 bypasses the=C2=A0<span style=3D"font-family:Calibri,Helvetica,sans-serif,=
Helvetica,EmojiFont,&quot;Apple Color Emoji&quot;,&quot;Segoe UI Emoji&quot=
;,NotoColorEmoji,&quot;Segoe UI Symbol&quot;,&quot;Android Emoji&quot;,Emoj=
iSymbols;font-size:16px">needToCleanupState check and causes NPE.
 A jira ticket [1] is created.=C2=A0</span></span></span><span style=3D"fon=
t-size:12pt"></span></div>
<div><span><span style=3D"font-family:Calibri,Helvetica,sans-serif,Helvetic=
a,EmojiFont,&quot;Apple Color Emoji&quot;,&quot;Segoe UI Emoji&quot;,NotoCo=
lorEmoji,&quot;Segoe UI Symbol&quot;,&quot;Android Emoji&quot;,EmojiSymbols=
;font-size:16px"><span style=3D"font-family:Calibri,Helvetica,sans-serif,He=
lvetica,EmojiFont,&quot;Apple Color Emoji&quot;,&quot;Segoe UI Emoji&quot;,=
NotoColorEmoji,&quot;Segoe UI Symbol&quot;,&quot;Android Emoji&quot;,EmojiS=
ymbols;font-size:16px"><br>
</span></span></span></div>
<div><span><span style=3D"font-family:Calibri,Helvetica,sans-serif,Helvetic=
a,EmojiFont,&quot;Apple Color Emoji&quot;,&quot;Segoe UI Emoji&quot;,NotoCo=
lorEmoji,&quot;Segoe UI Symbol&quot;,&quot;Android Emoji&quot;,EmojiSymbols=
;font-size:16px"><span style=3D"font-family:Calibri,Helvetica,sans-serif,He=
lvetica,EmojiFont,&quot;Apple Color Emoji&quot;,&quot;Segoe UI Emoji&quot;,=
NotoColorEmoji,&quot;Segoe UI Symbol&quot;,&quot;Android Emoji&quot;,EmojiS=
ymbols;font-size:16px">Best</span></span></span></div>
<div><span><span style=3D"font-family:Calibri,Helvetica,sans-serif,Helvetic=
a,EmojiFont,&quot;Apple Color Emoji&quot;,&quot;Segoe UI Emoji&quot;,NotoCo=
lorEmoji,&quot;Segoe UI Symbol&quot;,&quot;Android Emoji&quot;,EmojiSymbols=
;font-size:16px"><span style=3D"font-family:Calibri,Helvetica,sans-serif,He=
lvetica,EmojiFont,&quot;Apple Color Emoji&quot;,&quot;Segoe UI Emoji&quot;,=
NotoColorEmoji,&quot;Segoe UI Symbol&quot;,&quot;Android Emoji&quot;,EmojiS=
ymbols;font-size:16px">Yan</span></span></span></div>
<div><br>
</div>
<div><br>
</div>
<div>
<div><i></i></div>
<div><i>[ts:1528149296456] [label:state_ttl_update] register for cleanup at=
 1528150096456(CLEANUP_TIME_1), because of Row:(orderId:001,userId:U123)</i=
></div>
<div><i>[ts:1528149296456] [label:register_pt] register for process input a=
t 1528149296457, because of Row:(orderId:001,userId:U123)</i></div>
<div><i>[ts:1528149296458] [label:state_apply] ontimer at 1528149296457, ap=
ply Row:(orderId:001,userId:U123) to accumulator</i></div>
<div><i><br>
</i></div>
<div><i>[ts:1528149885813] [label:state_ttl_update] register at 15281506858=
13(<span style=3D"font-family:Calibri,Helvetica,sans-serif,Helvetica,EmojiF=
ont,&quot;Apple Color Emoji&quot;,&quot;Segoe UI Emoji&quot;,NotoColorEmoji=
,&quot;Segoe UI Symbol&quot;,&quot;Android Emoji&quot;,EmojiSymbols;font-si=
ze:16px">CLEANUP_TIME</span><span style=3D"font-family:Calibri,Helvetica,sa=
ns-serif,Helvetica,EmojiFont,&quot;Apple Color Emoji&quot;,&quot;Segoe UI E=
moji&quot;,NotoColorEmoji,&quot;Segoe UI Symbol&quot;,&quot;Android Emoji&q=
uot;,EmojiSymbols;font-size:16px">_2</span>),
 because of Row:(orderId:002,userId:U123)</i></div>
<div><i>[ts:1528149885813] [label:register_pt] register for process input a=
t 1528149885814, because of Row:(orderId:002,userId:U123)</i></div>
<div><i>[ts:1528149885814] [label:state_apply] ontimer at 1528149885814, ap=
ply Row:(orderId:002,userId:U123) to accumulator</i></div>
<div><i><br>
</i></div>
<div><i>[ts:1528150096460] [label:NO_ELEMENTS_IN_STATE] ontimer at 15281500=
96456(<span style=3D"font-family:Calibri,Helvetica,sans-serif,Helvetica,Emo=
jiFont,&quot;Apple Color Emoji&quot;,&quot;Segoe UI Emoji&quot;,NotoColorEm=
oji,&quot;Segoe UI Symbol&quot;,&quot;Android Emoji&quot;,EmojiSymbols;font=
-size:16px">CLEANUP_TIME</span><span style=3D"font-family:Calibri,Helvetica=
,sans-serif,Helvetica,EmojiFont,&quot;Apple Color Emoji&quot;,&quot;Segoe U=
I Emoji&quot;,NotoColorEmoji,&quot;Segoe UI Symbol&quot;,&quot;Android Emoj=
i&quot;,EmojiSymbols;font-size:16px">_1</span>),
 bypass needToCleanupState check, however rowMapState is {key:1528150096455=
, value:[]}</i></div>
<div><i><br>
</i></div>
<div><i>[ts:1528150685815] [label:state_timeout] ontimer at 1528150685813(<=
span style=3D"font-family:Calibri,Helvetica,sans-serif,Helvetica,EmojiFont,=
&quot;Apple Color Emoji&quot;,&quot;Segoe UI Emoji&quot;,NotoColorEmoji,&qu=
ot;Segoe UI Symbol&quot;,&quot;Android Emoji&quot;,EmojiSymbols;font-size:1=
6px">CLEANUP_TIME</span><span style=3D"font-family:Calibri,Helvetica,sans-s=
erif,Helvetica,EmojiFont,&quot;Apple Color Emoji&quot;,&quot;Segoe UI Emoji=
&quot;,NotoColorEmoji,&quot;Segoe UI Symbol&quot;,&quot;Android Emoji&quot;=
,EmojiSymbols;font-size:16px">_2</span>),
 clean/empty the rowMapState [{key:1528149885813, value:[Row:(orderId:002,<=
wbr>userId:U123)]}]</i></div>
<div><br>
</div>
<br>
</div>
<div><br>
<span style=3D"font-size:12pt"></span>
<div><br>
</div>
<div><br>
</div>
<div><br>
</div>
<div><br>
</div>
<div><br>
</div>
<div>[1] :=C2=A0<a href=3D"https://issues.apache.org/jira/browse/FLINK-9524=
" class=3D"m_-8272178736825963492OWAAutoLink" id=3D"m_-8272178736825963492L=
Plnk39566" target=3D"_blank">https://issues.apache.org/<wbr>jira/browse/FLI=
NK-9524</a><br>
</div>
</div>
<br>
<br>
<div style=3D"color:rgb(0,0,0)">
<hr style=3D"display:inline-block;width:98%">
<div id=3D"m_-8272178736825963492divRplyFwdMsg" dir=3D"ltr"><font style=3D"=
font-size:11pt" face=3D"Calibri, sans-serif" color=3D"#000000"><span class=
=3D""><b>From:</b> Yan Zhou [FDS Science] &lt;<a href=3D"mailto:yzhou@coupa=
ng.com" target=3D"_blank">yzhou@coupang.com</a>&gt;<br>
</span><b>Sent:</b> Monday, June 4, 2018 4:05 PM<br>
<b>To:</b> Fabian Hueske<div><div class=3D"h5"><br>
<b>Cc:</b> Dawid Wysakowicz; user<br>
<b>Subject:</b> Re: NPE in flink sql over-window</div></div></font>
<div>=C2=A0</div>
</div><div><div class=3D"h5">

<div dir=3D"ltr">
<div id=3D"m_-8272178736825963492x_divtagdefaultwrapper" dir=3D"ltr" style=
=3D"font-size:12pt;color:rgb(0,0,0);font-family:Calibri,Helvetica,sans-seri=
f,Helvetica,EmojiFont,&quot;Apple Color Emoji&quot;,&quot;Segoe UI Emoji&qu=
ot;,NotoColorEmoji,&quot;Segoe UI Symbol&quot;,&quot;Android Emoji&quot;,Em=
ojiSymbols">
<p style=3D"margin-top:0;margin-bottom:0">Hi Fabian,</p>
<p style=3D"margin-top:0;margin-bottom:0"><br>
</p>
<p style=3D"margin-top:0;margin-bottom:0">Yes, the NPE cause the job failur=
e and recovery( instead of being the result of a recovery).=C2=A0</p>
<p style=3D"margin-top:0;margin-bottom:0">And yet, during the recovery, the=
 same exceptions are thrown from same line.</p>
<p style=3D"margin-top:0;margin-bottom:0"><br>
</p>
<p style=3D"margin-top:0;margin-bottom:0">Best</p>
<p style=3D"margin-top:0;margin-bottom:0">Yan</p>
</div>
<hr style=3D"display:inline-block;width:98%">
<div id=3D"m_-8272178736825963492x_divRplyFwdMsg" dir=3D"ltr"><font style=
=3D"font-size:11pt" face=3D"Calibri, sans-serif" color=3D"#000000"><b>From:=
</b> Fabian Hueske &lt;<a href=3D"mailto:fhueske@gmail.com" target=3D"_blan=
k">fhueske@gmail.com</a>&gt;<br>
<b>Sent:</b> Thursday, May 31, 2018 12:09:03 AM<br>
<b>To:</b> Yan Zhou [FDS Science]<br>
<b>Cc:</b> Dawid Wysakowicz; user<br>
<b>Subject:</b> Re: NPE in flink sql over-window</font>
<div>=C2=A0</div>
</div>

<div>
<div dir=3D"ltr">
<div>Hi Yan, <br>
<br>
Thanks for the details and for digging into the issue.<br>
If I got it right, the NPE caused the job failure and recovery (instead of =
being the result of a recovery), correct?<br>
<br>
</div>
Best, Fabian<br>
</div>
<div class=3D"m_-8272178736825963492x_x_gmail_extra"><br>
<div class=3D"m_-8272178736825963492x_x_gmail_quote">2018-05-31 7:00 GMT+02=
:00 Yan Zhou [FDS Science] <span dir=3D"ltr">
&lt;<a href=3D"mailto:yzhou@coupang.com" id=3D"m_-8272178736825963492LPlnk1=
06838" class=3D"m_-8272178736825963492OWAAutoLink" target=3D"_blank">yzhou@=
coupang.com</a>&gt;</span>:<br>
<blockquote class=3D"m_-8272178736825963492x_x_gmail_quote" style=3D"margin=
:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">
<div dir=3D"ltr">
<div id=3D"m_-8272178736825963492x_x_m_-7312647659707993901divtagdefaultwra=
pper" dir=3D"ltr" style=3D"font-size:12pt;color:rgb(0,0,0);font-family:Cali=
bri,Helvetica,sans-serif,Helvetica,EmojiFont,&quot;Apple Color Emoji&quot;,=
&quot;Segoe UI Emoji&quot;,NotoColorEmoji,&quot;Segoe UI Symbol&quot;,&quot=
;Android Emoji&quot;,EmojiSymbols">
<p style=3D"margin-top:0;margin-bottom:0">Thanks for the replay.</p>
<p style=3D"margin-top:0;margin-bottom:0"><br>
</p>
<p style=3D"margin-top:0;margin-bottom:0">Yes, it only happen if I config t=
he idle state retention times. The error occurs the first time before the f=
irst recovery. I haven&#39;t run with proctime but rowtime in flink 1.4.x. =
I am not sure if it will cause problems
 with proctime in 1.4.x.=C2=A0</p>
<p style=3D"margin-top:0;margin-bottom:0"><br>
</p>
<p style=3D"margin-top:0;margin-bottom:0">I am adding some trace log for=C2=
=A0<span>ProcTimeBoundedRangeOver. I will update with=C2=A0my test result a=
nd fire a JIRA after that.</span></p>
<p style=3D"margin-top:0;margin-bottom:0"><span><br>
</span></p>
<p style=3D"margin-top:0;margin-bottom:0"><span>Best</span></p>
<p style=3D"margin-top:0;margin-bottom:0"><span>Yan=C2=A0</span></p>
</div>
<hr style=3D"display:inline-block;width:98%">
<div id=3D"m_-8272178736825963492x_x_m_-7312647659707993901divRplyFwdMsg" d=
ir=3D"ltr"><font style=3D"font-size:11pt" face=3D"Calibri, sans-serif" colo=
r=3D"#000000"><b>From:</b> Fabian Hueske &lt;<a href=3D"mailto:fhueske@gmai=
l.com" id=3D"m_-8272178736825963492LPlnk871872" class=3D"m_-827217873682596=
3492OWAAutoLink" target=3D"_blank">fhueske@gmail.com</a>&gt;<br>
<b>Sent:</b> Wednesday, May 30, 2018 1:43:01 AM<br>
<b>To:</b> Dawid Wysakowicz<br>
<b>Cc:</b> user<br>
<b>Subject:</b> Re: NPE in flink sql over-window</font>
<div>=C2=A0</div>
</div>
<div>
<div class=3D"m_-8272178736825963492x_x_h5">
<div>
<div dir=3D"ltr">
<div>
<div>
<div>Hi,<br>
<br>
</div>
Dawid&#39;s analysis is certainly correct, but looking at the code this sho=
uld not happen.<br>
</div>
<br>
I have a few questions:<br>
- You said this only happens if you configure idle state retention times, r=
ight?<br>
</div>
<div>- Does the error occur the first time without a previous recovery?<br>
</div>
- Did you run the same query on Flink 1.4.x without any problems?<br>
<div>
<div><br>
</div>
<div>Thanks, Fabian<br>
</div>
</div>
</div>
<div class=3D"m_-8272178736825963492x_x_m_-7312647659707993901x_gmail_extra=
"><br>
<div class=3D"m_-8272178736825963492x_x_m_-7312647659707993901x_gmail_quote=
">2018-05-30 9:25 GMT+02:00 Dawid Wysakowicz
<span dir=3D"ltr">&lt;<a href=3D"mailto:dwysakowicz@apache.org" id=3D"m_-82=
72178736825963492LPlnk384115" class=3D"m_-8272178736825963492OWAAutoLink" t=
arget=3D"_blank">dwysakowicz@apache.org</a>&gt;</span>:<br>
<blockquote class=3D"m_-8272178736825963492x_x_m_-7312647659707993901x_gmai=
l_quote" style=3D"margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left=
:1ex">
<div bgcolor=3D"#FFFFFF">
<p>Hi Yan,</p>
<p><br>
</p>
<p>I think it is a bug in the ProcTimeBoundedRangeOver. It tries to access =
a list of elements that was already cleared and does not check against null=
. Could you please file a JIRA for that?</p>
<p><br>
</p>
<p>Best,</p>
<p>Dawid<br>
</p>
<div>
<div class=3D"m_-8272178736825963492x_x_m_-7312647659707993901x_h5"><br>
<div class=3D"m_-8272178736825963492x_x_m_-7312647659707993901x_m_-68883887=
85932496957moz-cite-prefix">On 30/05/18 08:27, Yan Zhou [FDS Science] wrote=
:<br>
</div>
<blockquote type=3D"cite">
<div id=3D"m_-8272178736825963492x_x_m_-7312647659707993901x_m_-68883887859=
32496957divtagdefaultwrapper" dir=3D"ltr" style=3D"font-size:12pt;color:rgb=
(0,0,0);font-family:Calibri,Helvetica,sans-serif,Helvetica,EmojiFont,&quot;=
Apple Color Emoji&quot;,&quot;Segoe UI Emoji&quot;,NotoColorEmoji,&quot;Seg=
oe UI Symbol&quot;,&quot;Android Emoji&quot;,EmojiSymbols">
<p style=3D"margin-top:0;margin-bottom:0">I also get=C2=A0<span>warnning th=
at</span>=C2=A0CodeCache is full around that time. It&#39;s printed by JVM =
and doesn&#39;t have timestamp. But I
<span>suspect that it&#39;s because so many failure recoveries from checkpo=
int and the sql queries are dynamically compiled too many times.</span></p>
<p style=3D"margin-top:0;margin-bottom:0"><span><br>
</span></p>
<p style=3D"margin-top:0;margin-bottom:0"><span><br>
</span></p>
<p style=3D"margin-top:0;margin-bottom:0"><span></span></p>
<div><i>Java HotSpot(TM) 64-Bit Server VM warning: CodeCache is full. Compi=
ler has been disabled.</i></div>
<div><i>Java HotSpot(TM) 64-Bit Server VM warning: Try increasing the code =
cache size using -XX:ReservedCodeCacheSize=3D</i></div>
<div><i>CodeCache: size=3D245760Kb used=3D244114Kb max_used=3D244146Kb free=
=3D1645Kb</i></div>
<div><i>bounds [0x00007fa4fd000000, 0x00007fa50c000000, 0x00007fa50c000000]=
</i></div>
<div><i>total_blobs=3D54308 nmethods=3D53551 adapters=3D617</i></div>
<div><i>compilation: disabled (not enough contiguous free space left)</i></=
div>
<br>
<p style=3D"margin-top:0;margin-bottom:0"><span><br>
</span></p>
<p style=3D"margin-top:0;margin-bottom:0"><span><br>
</span></p>
</div>
<hr style=3D"display:inline-block;width:98%">
<div id=3D"m_-8272178736825963492x_x_m_-7312647659707993901x_m_-68883887859=
32496957divRplyFwdMsg" dir=3D"ltr">
<font style=3D"font-size:11pt" face=3D"Calibri, sans-serif" color=3D"#00000=
0"><b>From:</b> Yan Zhou [FDS Science]
<a class=3D"m_-8272178736825963492x_x_m_-7312647659707993901x_m_-6888388785=
932496957moz-txt-link-rfc2396E m_-8272178736825963492OWAAutoLink" href=3D"m=
ailto:yzhou@coupang.com" id=3D"m_-8272178736825963492LPlnk189879" target=3D=
"_blank">
&lt;yzhou@coupang.com&gt;</a><br>
<b>Sent:</b> Tuesday, May 29, 2018 10:52:18 PM<br>
<b>To:</b> <a class=3D"m_-8272178736825963492x_x_m_-7312647659707993901x_m_=
-6888388785932496957moz-txt-link-abbreviated m_-8272178736825963492OWAAutoL=
ink" href=3D"mailto:user@flink.apache.org" id=3D"m_-8272178736825963492LPln=
k457838" target=3D"_blank">
user@flink.apache.org</a><br>
<b>Subject:</b> NPE in flink sql over-window </font>
<div>=C2=A0</div>
</div>
<div dir=3D"ltr">
<div id=3D"m_-8272178736825963492x_x_m_-7312647659707993901x_m_-68883887859=
32496957x_divtagdefaultwrapper" dir=3D"ltr">
<p style=3D"margin-top:0;margin-bottom:0">Hi,</p>
<p style=3D"margin-top:0;margin-bottom:0">I am using flink sql 1.5.0. <span=
>My</span><span>=C2=A0application</span>=C2=A0throws=C2=A0NPE. And after it=
 recover from checkpoint automatically, it throws NPE immediately from same=
 line of code.=C2=A0</p>
<p style=3D"margin-top:0;margin-bottom:0"><br>
</p>
<p style=3D"margin-top:0;margin-bottom:0">My application read message from =
kafka, convert the datastream into a table,=C2=A0issue an Over-window aggre=
gation and write the result into a sink. NPE throws from class=C2=A0<span><=
span>ProcTimeBoundedRangeOver</span><wbr>.
 Please see exception log at the bottom.</span></p>
<p style=3D"margin-top:0;margin-bottom:0"><span><br>
</span></p>
<p style=3D"margin-top:0;margin-bottom:0"><span>The exceptions always happe=
ns after the application started for=C2=A0<span><i>maxIdleStateRetentionTim=
e
</i>time.</span> =C2=A0What could be the possible causes?=C2=A0</span></p>
<p style=3D"margin-top:0;margin-bottom:0"><span><br>
</span></p>
<p style=3D"margin-top:0;margin-bottom:0"><span>Best</span></p>
<p style=3D"margin-top:0;margin-bottom:0"><span>Yan</span></p>
<p style=3D"margin-top:0;margin-bottom:0"><span><br>
</span></p>
<p style=3D"margin-top:0;margin-bottom:0"><span></span></p>
<div><i>2018-05-27 11:03:37,656 INFO=C2=A0 org.apache.flink.runtime.taskm<w=
br>anager.Task=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=
=A0 =C2=A0 =C2=A0- over: (PARTITION BY: uid, ORDER BY: proctime, RANGEBETWE=
EN 86400000 PRECEDI</i></div>
<div><i>NG AND CURRENT ROW, select: (id, uid, proctime, group_concat($7) AS=
 w0$o0)) -&gt; select:=C2=A0</i></div>
<div><i>(id, uid, proctime, w0$o0 AS EXPR$3) -&gt; to: Row -&gt; Flat Map -=
&gt; Filter -&gt; Sink: Unnamed (3/15) (327</i></div>
<div><i>efe96243bbfdf1f1e40a3372f64aa) switched from RUNNING to FAILED.</i>=
</div>
<div><i>TimerException{java.lang.NullP<wbr>ointerException}</i></div>
<div><i>=C2=A0 =C2=A0 =C2=A0 =C2=A0at org.apache.flink.streaming.run<wbr>ti=
me.tasks.SystemProcessingTim<wbr>eService$TriggerTask.run(Syste<wbr>mProces=
singTimeService.java:28<wbr>4)</i></div>
<div><i>=C2=A0 =C2=A0 =C2=A0 =C2=A0at java.util.concurrent.Executors<wbr>$R=
unnableAdapter.call(Executor<wbr>s.java:511)</i></div>
<div><i>=C2=A0 =C2=A0 =C2=A0 =C2=A0at java.util.concurrent.FutureTas<wbr>k.=
run(FutureTask.java:266)</i></div>
<div><i>=C2=A0 =C2=A0 =C2=A0 =C2=A0at java.util.concurrent.Scheduled<wbr>Th=
readPoolExecutor$ScheduledFu<wbr>tureTask.access$201(ScheduledT<wbr>hreadPo=
olExecutor.java:180)</i></div>
<div><i>=C2=A0 =C2=A0 =C2=A0 =C2=A0at java.util.concurrent.Scheduled<wbr>Th=
readPoolExecutor$ScheduledFu<wbr>tureTask.run(ScheduledThreadPo<wbr>olExecu=
tor.java:293)</i></div>
<div><i>=C2=A0 =C2=A0 =C2=A0 =C2=A0at java.util.concurrent.ThreadPoo<wbr>lE=
xecutor.runWorker(ThreadPool<wbr>Executor.java:1142)</i></div>
<div><i>=C2=A0 =C2=A0 =C2=A0 =C2=A0at java.util.concurrent.ThreadPoo<wbr>lE=
xecutor$Worker.run(ThreadPoo<wbr>lExecutor.java:617)</i></div>
<div><i>=C2=A0 =C2=A0 =C2=A0 =C2=A0at java.lang.Thread.run(Thread.ja<wbr>va=
:748)</i></div>
<div><i>Caused by: java.lang.NullPointerException</i></div>
<div><i>=C2=A0 =C2=A0 =C2=A0 =C2=A0at org.apache.flink.table.runtime<wbr>.a=
ggregate.ProcTimeBoundedRang<wbr>eOverWithLog.onTimer(ProcTimeB<wbr>oundedR=
angeOver.scala:181)</i></div>
<div><i>=C2=A0 =C2=A0 =C2=A0 =C2=A0at org.apache.flink.streaming.api<wbr>.o=
perators.LegacyKeyedProcessO<wbr>perator.invokeUserFunction(Leg<wbr>acyKeye=
dProcessOperator.java:9<wbr>7)</i></div>
<div><i>=C2=A0 =C2=A0 =C2=A0 =C2=A0at org.apache.flink.streaming.api<wbr>.o=
perators.LegacyKeyedProcessO<wbr>perator.onProcessingTime(Legac<wbr>yKeyedP=
rocessOperator.java:81)</i></div>
<div><i>=C2=A0 =C2=A0 =C2=A0 =C2=A0at org.apache.flink.streaming.api<wbr>.o=
perators.HeapInternalTimerSe<wbr>rvice.onProcessingTime(HeapInt<wbr>ernalTi=
merService.java:266)</i></div>
<div><i>=C2=A0 =C2=A0 =C2=A0 =C2=A0at org.apache.flink.streaming.run<wbr>ti=
me.tasks.SystemProcessingTim<wbr>eService$TriggerTask.run(Syste<wbr>mProces=
singTimeService.java:28<wbr>1)</i></div>
<br>
<p style=3D"margin-top:0;margin-bottom:0"><span><br>
</span></p>
<p style=3D"margin-top:0;margin-bottom:0"><br>
</p>
</div>
</div>
</blockquote>
<br>
</div>
</div>
</div>
</blockquote>
</div>
<br>
</div>
</div>
</div>
</div>
</div>
</blockquote>
</div>
<br>
</div>
</div>
</div>
</div></div></div>
</div>
</div>
</div>

</blockquote></div><br></div>

--000000000000827b2e056de220ae--