Mailing-List: contact user-help@flink.apache.org; run by ezmlm
Precedence: bulk
Reply-To: user@flink.apache.org
Content-Type: multipart/signed;
 boundary="Apple-Mail=_E001464D-7E6D-4A8F-8797-AD5B5931169B";
 protocol="application/pgp-signature"; micalg=pgp-sha256
Mime-Version: 1.0 (Mac OS X Mail 9.2 \(3112\))
Subject: Re: Jobmanager HA with Rolling Sink in HDFS
From: Maximilian Bode <maximilian.bode@tngtech.com>
In-Reply-To: <B6C823EC-4904-431F-852B-D55BF0597DB3@apache.org>
Date: Mon, 7 Mar 2016 17:36:53 +0100
Message-Id: <E7C753B0-09E1-433E-97DA-AD205DEAF184@tngtech.com>
References: <94248245-7C64-40E0-81A7-91B555398DC3@tngtech.com>
 <EB0FBEB9-9BB5-47FA-A8C8-4EC6DBDDB04D@tngtech.com>
 <FC44D61B-985B-4E16-9272-F99BF1F9E02E@apache.org>
 <B6C823EC-4904-431F-852B-D55BF0597DB3@apache.org>
To: user@flink.apache.org


--Apple-Mail=_E001464D-7E6D-4A8F-8797-AD5B5931169B
Content-Type: multipart/alternative;
	boundary="Apple-Mail=_F58E3EC0-EEFE-4423-91B0-8AA62DC42FC6"


--Apple-Mail=_F58E3EC0-EEFE-4423-91B0-8AA62DC42FC6
Content-Transfer-Encoding: quoted-printable
Content-Type: text/plain;
	charset=utf-8

Hi Aljoscha,

thank you very much, I will try if this fixes the problem and get back =
to you. I am using 1.0.0 as of today :)

Cheers,
 Max
=E2=80=94
Maximilian Bode * Junior Consultant * maximilian.bode@tngtech.com
TNG Technology Consulting GmbH, Betastr. 13a, 85774 Unterf=C3=B6hring
Gesch=C3=A4ftsf=C3=BChrer: Henrik Klagges, Christoph Stock, Dr. Robert =
Dahlke
Sitz: Unterf=C3=B6hring * Amtsgericht M=C3=BCnchen * HRB 135082

> Am 07.03.2016 um 17:20 schrieb Aljoscha Krettek <aljoscha@apache.org>:
>=20
> Hi Maximilian,
> sorry for the delay, we where very busy with the release last week. I =
had a hunch about the problem but I think I found a fix now. The problem =
is in snapshot restore. When restoring, the sink tries to clean up any =
files that where previously in progress. If Flink restores to the same =
snapshot twice in a row then it will try to clean up the leftover files =
twice but they are not there anymore, this causes the exception.
>=20
> I have a fix in my branch: =
https://github.com/aljoscha/flink/tree/rolling-sink-fix
>=20
> Could you maybe try if this solves your problem? Which version of =
Flink are you using? You would have to build from source to try it out. =
Alternatively I could build it and put it onto a maven snapshot =
repository for you to try it out.
>=20
> Cheers,
> Aljoscha
>> On 03 Mar 2016, at 14:50, Aljoscha Krettek <aljoscha@apache.org> =
wrote:
>>=20
>> Hi,
>> did you check whether there are any files at your specified HDFS =
output location? If yes, which files are there?
>>=20
>> Cheers,
>> Aljoscha
>>> On 03 Mar 2016, at 14:29, Maximilian Bode =
<maximilian.bode@tngtech.com> wrote:
>>>=20
>>> Just for the sake of completeness: this also happens when killing a =
task manager and is therefore probably unrelated to job manager HA.
>>>=20
>>>> Am 03.03.2016 um 14:17 schrieb Maximilian Bode =
<maximilian.bode@tngtech.com>:
>>>>=20
>>>> Hi everyone,
>>>>=20
>>>> unfortunately, I am running into another problem trying to =
establish exactly once guarantees (Kafka -> Flink 1.0.0-rc3 -> HDFS).
>>>>=20
>>>> When using
>>>>=20
>>>> RollingSink<Tuple3<Integer,Integer,String>> sink =3D new =
RollingSink<Tuple3<Integer,Integer,String>>("hdfs://our.machine.com:8020/h=
dfs/dir/outbound");
>>>> sink.setBucketer(new NonRollingBucketer());
>>>> output.addSink(sink);
>>>>=20
>>>> and then killing the job manager, the new job manager is unable to =
restore the old state throwing
>>>> ---
>>>> java.lang.Exception: Could not restore checkpointed state to =
operators and functions
>>>> 	at =
org.apache.flink.streaming.runtime.tasks.StreamTask.restoreState(StreamTas=
k.java:454)
>>>> 	at =
org.apache.flink.streaming.runtime.tasks.StreamTask.invoke(StreamTask.java=
:209)
>>>> 	at org.apache.flink.runtime.taskmanager.Task.run(Task.java:559)
>>>> 	at java.lang.Thread.run(Thread.java:744)
>>>> Caused by: java.lang.Exception: Failed to restore state to =
function: In-Progress file =
hdfs://our.machine.com:8020/hdfs/dir/outbound/part-5-0 was neither moved =
to pending nor is still in progress.
>>>> 	at =
org.apache.flink.streaming.api.operators.AbstractUdfStreamOperator.restore=
State(AbstractUdfStreamOperator.java:168)
>>>> 	at =
org.apache.flink.streaming.runtime.tasks.StreamTask.restoreState(StreamTas=
k.java:446)
>>>> 	... 3 more
>>>> Caused by: java.lang.RuntimeException: In-Progress file =
hdfs://our.machine.com:8020/hdfs/dir/outbound/part-5-0 was neither moved =
to pending nor is still in progress.
>>>> 	at =
org.apache.flink.streaming.connectors.fs.RollingSink.restoreState(RollingS=
ink.java:686)
>>>> 	at =
org.apache.flink.streaming.connectors.fs.RollingSink.restoreState(RollingS=
ink.java:122)
>>>> 	at =
org.apache.flink.streaming.api.operators.AbstractUdfStreamOperator.restore=
State(AbstractUdfStreamOperator.java:165)
>>>> 	... 4 more
>>>> ---
>>>> I found a resolved issue [1] concerning Hadoop 2.7.1. We are in =
fact using 2.4.0 =E2=80=93 might this be the same issue?
>>>>=20
>>>> Another thing I could think of is that the job is not configured =
correctly and there is some sort of timing issue. The checkpoint =
interval is 10 seconds, everything else was left at default value. Then =
again, as the NonRollingBucketer is used, there should not be any timing =
issues, right?
>>>>=20
>>>> Cheers,
>>>> Max
>>>>=20
>>>> [1] https://issues.apache.org/jira/browse/FLINK-2979
>>>>=20
>>>> =E2=80=94
>>>> Maximilian Bode * Junior Consultant * maximilian.bode@tngtech.com
>>>> TNG Technology Consulting GmbH, Betastr. 13a, 85774 Unterf=C3=B6hring=

>>>> Gesch=C3=A4ftsf=C3=BChrer: Henrik Klagges, Christoph Stock, Dr. =
Robert Dahlke
>>>> Sitz: Unterf=C3=B6hring * Amtsgericht M=C3=BCnchen * HRB 135082
>>>>=20
>>>=20
>>=20
>=20


--Apple-Mail=_F58E3EC0-EEFE-4423-91B0-8AA62DC42FC6
Content-Transfer-Encoding: quoted-printable
Content-Type: text/html;
	charset=utf-8

<html><head><meta http-equiv=3D"Content-Type" content=3D"text/html =
charset=3Dutf-8"></head><body style=3D"word-wrap: break-word; =
-webkit-nbsp-mode: space; -webkit-line-break: after-white-space;" =
class=3D"">Hi Aljoscha,<div class=3D""><br class=3D""></div><div =
class=3D"">thank you very much, I will try if this fixes the problem and =
get back to you. I am using 1.0.0 as of today :)</div><div class=3D""><br =
class=3D""></div><div class=3D"">Cheers,</div><div class=3D"">&nbsp;Max<br=
 class=3D""><div class=3D"">
<div style=3D"color: rgb(0, 0, 0); letter-spacing: normal; orphans: =
auto; text-align: start; text-indent: 0px; text-transform: none; =
white-space: normal; widows: auto; word-spacing: 0px; =
-webkit-text-stroke-width: 0px; word-wrap: break-word; =
-webkit-nbsp-mode: space; -webkit-line-break: after-white-space;" =
class=3D""><div style=3D"color: rgb(0, 0, 0); letter-spacing: normal; =
orphans: auto; text-align: start; text-indent: 0px; text-transform: =
none; white-space: normal; widows: auto; word-spacing: 0px; =
-webkit-text-stroke-width: 0px; word-wrap: break-word; =
-webkit-nbsp-mode: space; -webkit-line-break: after-white-space;" =
class=3D""><div style=3D"color: rgb(0, 0, 0); letter-spacing: normal; =
orphans: auto; text-align: start; text-indent: 0px; text-transform: =
none; white-space: normal; widows: auto; word-spacing: 0px; =
-webkit-text-stroke-width: 0px; word-wrap: break-word; =
-webkit-nbsp-mode: space; -webkit-line-break: after-white-space;" =
class=3D""><div style=3D"color: rgb(0, 0, 0); letter-spacing: normal; =
orphans: auto; text-align: start; text-indent: 0px; text-transform: =
none; white-space: normal; widows: auto; word-spacing: 0px; =
-webkit-text-stroke-width: 0px; word-wrap: break-word; =
-webkit-nbsp-mode: space; -webkit-line-break: after-white-space;" =
class=3D""><div style=3D"color: rgb(0, 0, 0); letter-spacing: normal; =
orphans: auto; text-align: start; text-indent: 0px; text-transform: =
none; white-space: normal; widows: auto; word-spacing: 0px; =
-webkit-text-stroke-width: 0px; word-wrap: break-word; =
-webkit-nbsp-mode: space; -webkit-line-break: after-white-space;" =
class=3D""><div style=3D"color: rgb(0, 0, 0); letter-spacing: normal; =
orphans: auto; text-align: start; text-indent: 0px; text-transform: =
none; white-space: normal; widows: auto; word-spacing: 0px; =
-webkit-text-stroke-width: 0px; word-wrap: break-word; =
-webkit-nbsp-mode: space; -webkit-line-break: after-white-space;" =
class=3D""><div style=3D"color: rgb(0, 0, 0); letter-spacing: normal; =
orphans: auto; text-align: start; text-indent: 0px; text-transform: =
none; white-space: normal; widows: auto; word-spacing: 0px; =
-webkit-text-stroke-width: 0px; word-wrap: break-word; =
-webkit-nbsp-mode: space; -webkit-line-break: after-white-space;" =
class=3D"">=E2=80=94&nbsp;</div><div style=3D"color: rgb(0, 0, 0); =
letter-spacing: normal; orphans: auto; text-align: start; text-indent: =
0px; text-transform: none; white-space: normal; widows: auto; =
word-spacing: 0px; -webkit-text-stroke-width: 0px; word-wrap: =
break-word; -webkit-nbsp-mode: space; -webkit-line-break: =
after-white-space;" class=3D"">Maximilian Bode * Junior Consultant * <a =
href=3D"mailto:maximilian.bode@tngtech.com" =
class=3D"">maximilian.bode@tngtech.com</a><div class=3D"">TNG Technology =
Consulting GmbH, Betastr. 13a, 85774 Unterf=C3=B6hring</div><div =
class=3D"">Gesch=C3=A4ftsf=C3=BChrer: Henrik Klagges, Christoph Stock, =
Dr. Robert Dahlke</div><div class=3D"">Sitz: Unterf=C3=B6hring * =
Amtsgericht M=C3=BCnchen * HRB =
135082</div></div></div></div></div></div></div></div>
</div>
<br class=3D""><div><blockquote type=3D"cite" class=3D""><div =
class=3D"">Am 07.03.2016 um 17:20 schrieb Aljoscha Krettek &lt;<a =
href=3D"mailto:aljoscha@apache.org" =
class=3D"">aljoscha@apache.org</a>&gt;:</div><br =
class=3D"Apple-interchange-newline"><div class=3D""><div class=3D"">Hi =
Maximilian,<br class=3D"">sorry for the delay, we where very busy with =
the release last week. I had a hunch about the problem but I think I =
found a fix now. The problem is in snapshot restore. When restoring, the =
sink tries to clean up any files that where previously in progress. If =
Flink restores to the same snapshot twice in a row then it will try to =
clean up the leftover files twice but they are not there anymore, this =
causes the exception.<br class=3D""><br class=3D"">I have a fix in my =
branch: <a =
href=3D"https://github.com/aljoscha/flink/tree/rolling-sink-fix" =
class=3D"">https://github.com/aljoscha/flink/tree/rolling-sink-fix</a><br =
class=3D""><br class=3D"">Could you maybe try if this solves your =
problem? Which version of Flink are you using? You would have to build =
from source to try it out. Alternatively I could build it and put it =
onto a maven snapshot repository for you to try it out.<br class=3D""><br =
class=3D"">Cheers,<br class=3D"">Aljoscha<br class=3D""><blockquote =
type=3D"cite" class=3D"">On 03 Mar 2016, at 14:50, Aljoscha Krettek =
&lt;<a href=3D"mailto:aljoscha@apache.org" =
class=3D"">aljoscha@apache.org</a>&gt; wrote:<br class=3D""><br =
class=3D"">Hi,<br class=3D"">did you check whether there are any files =
at your specified HDFS output location? If yes, which files are =
there?<br class=3D""><br class=3D"">Cheers,<br class=3D"">Aljoscha<br =
class=3D""><blockquote type=3D"cite" class=3D"">On 03 Mar 2016, at =
14:29, Maximilian Bode &lt;<a href=3D"mailto:maximilian.bode@tngtech.com" =
class=3D"">maximilian.bode@tngtech.com</a>&gt; wrote:<br class=3D""><br =
class=3D"">Just for the sake of completeness: this also happens when =
killing a task manager and is therefore probably unrelated to job =
manager HA.<br class=3D""><br class=3D""><blockquote type=3D"cite" =
class=3D"">Am 03.03.2016 um 14:17 schrieb Maximilian Bode &lt;<a =
href=3D"mailto:maximilian.bode@tngtech.com" =
class=3D"">maximilian.bode@tngtech.com</a>&gt;:<br class=3D""><br =
class=3D"">Hi everyone,<br class=3D""><br class=3D"">unfortunately, I am =
running into another problem trying to establish exactly once guarantees =
(Kafka -&gt; Flink 1.0.0-rc3 -&gt; HDFS).<br class=3D""><br =
class=3D"">When using<br class=3D""><br =
class=3D"">RollingSink&lt;Tuple3&lt;Integer,Integer,String&gt;&gt; sink =
=3D new RollingSink&lt;Tuple3&lt;Integer,Integer,String&gt;&gt;("<a =
href=3D"hdfs://our.machine.com:8020/hdfs/dir/outbound" =
class=3D"">hdfs://our.machine.com:8020/hdfs/dir/outbound</a>");<br =
class=3D"">sink.setBucketer(new NonRollingBucketer());<br =
class=3D"">output.addSink(sink);<br class=3D""><br class=3D"">and then =
killing the job manager, the new job manager is unable to restore the =
old state throwing<br class=3D"">---<br class=3D"">java.lang.Exception: =
Could not restore checkpointed state to operators and functions<br =
class=3D""><span class=3D"Apple-tab-span" style=3D"white-space:pre">	=
</span>at =
org.apache.flink.streaming.runtime.tasks.StreamTask.restoreState(StreamTas=
k.java:454)<br class=3D""><span class=3D"Apple-tab-span" =
style=3D"white-space:pre">	</span>at =
org.apache.flink.streaming.runtime.tasks.StreamTask.invoke(StreamTask.java=
:209)<br class=3D""><span class=3D"Apple-tab-span" =
style=3D"white-space:pre">	</span>at =
org.apache.flink.runtime.taskmanager.Task.run(Task.java:559)<br =
class=3D""><span class=3D"Apple-tab-span" style=3D"white-space:pre">	=
</span>at java.lang.Thread.run(Thread.java:744)<br class=3D"">Caused by: =
java.lang.Exception: Failed to restore state to function: In-Progress =
file <a href=3D"hdfs://our.machine.com:8020/hdfs/dir/outbound/part-5-0" =
class=3D"">hdfs://our.machine.com:8020/hdfs/dir/outbound/part-5-0</a> =
was neither moved to pending nor is still in progress.<br class=3D""><span=
 class=3D"Apple-tab-span" style=3D"white-space:pre">	</span>at =
org.apache.flink.streaming.api.operators.AbstractUdfStreamOperator.restore=
State(AbstractUdfStreamOperator.java:168)<br class=3D""><span =
class=3D"Apple-tab-span" style=3D"white-space:pre">	</span>at =
org.apache.flink.streaming.runtime.tasks.StreamTask.restoreState(StreamTas=
k.java:446)<br class=3D""><span class=3D"Apple-tab-span" =
style=3D"white-space:pre">	</span>... 3 more<br class=3D"">Caused =
by: java.lang.RuntimeException: In-Progress file <a =
href=3D"hdfs://our.machine.com:8020/hdfs/dir/outbound/part-5-0" =
class=3D"">hdfs://our.machine.com:8020/hdfs/dir/outbound/part-5-0</a> =
was neither moved to pending nor is still in progress.<br class=3D""><span=
 class=3D"Apple-tab-span" style=3D"white-space:pre">	</span>at =
org.apache.flink.streaming.connectors.fs.RollingSink.restoreState(RollingS=
ink.java:686)<br class=3D""><span class=3D"Apple-tab-span" =
style=3D"white-space:pre">	</span>at =
org.apache.flink.streaming.connectors.fs.RollingSink.restoreState(RollingS=
ink.java:122)<br class=3D""><span class=3D"Apple-tab-span" =
style=3D"white-space:pre">	</span>at =
org.apache.flink.streaming.api.operators.AbstractUdfStreamOperator.restore=
State(AbstractUdfStreamOperator.java:165)<br class=3D""><span =
class=3D"Apple-tab-span" style=3D"white-space:pre">	</span>... 4 =
more<br class=3D"">---<br class=3D"">I found a resolved issue [1] =
concerning Hadoop 2.7.1. We are in fact using 2.4.0 =E2=80=93 might this =
be the same issue?<br class=3D""><br class=3D"">Another thing I could =
think of is that the job is not configured correctly and there is some =
sort of timing issue. The checkpoint interval is 10 seconds, everything =
else was left at default value. Then again, as the NonRollingBucketer is =
used, there should not be any timing issues, right?<br class=3D""><br =
class=3D"">Cheers,<br class=3D"">Max<br class=3D""><br class=3D"">[1] <a =
href=3D"https://issues.apache.org/jira/browse/FLINK-2979" =
class=3D"">https://issues.apache.org/jira/browse/FLINK-2979</a><br =
class=3D""><br class=3D"">=E2=80=94 <br class=3D"">Maximilian Bode * =
Junior Consultant * <a href=3D"mailto:maximilian.bode@tngtech.com" =
class=3D"">maximilian.bode@tngtech.com</a><br class=3D"">TNG Technology =
Consulting GmbH, Betastr. 13a, 85774 Unterf=C3=B6hring<br =
class=3D"">Gesch=C3=A4ftsf=C3=BChrer: Henrik Klagges, Christoph Stock, =
Dr. Robert Dahlke<br class=3D"">Sitz: Unterf=C3=B6hring * Amtsgericht =
M=C3=BCnchen * HRB 135082<br class=3D""><br class=3D""></blockquote><br =
class=3D""></blockquote><br class=3D""></blockquote><br =
class=3D""></div></div></blockquote></div><br =
class=3D""></div></body></html>=

--Apple-Mail=_F58E3EC0-EEFE-4423-91B0-8AA62DC42FC6--

--Apple-Mail=_E001464D-7E6D-4A8F-8797-AD5B5931169B
Content-Transfer-Encoding: 7bit
Content-Disposition: attachment;
	filename=signature.asc
Content-Type: application/pgp-signature;
	name=signature.asc
Content-Description: Message signed with OpenPGP using GPGMail

-----BEGIN PGP SIGNATURE-----

iQEcBAEBCAAGBQJW3a4lAAoJEORguq51JMZaGMkIALJdkNI8Gls4nJZL/KREsC5B
NMgFMvYH5IUHn96uFWut99J0Ol9mBvpxBK6DwHgpOiUTTmrrEnd2oxkgbZ3bPlBo
HZCPeH77WooNZVhuCo0Bb4zdpFeTJcxGpm48FbeN+Ovo2bUkSeFUeDISG7D8RDyr
yLhVFWANXzHTvuY6Q11RFBhXhURHVzFA4EsRk1bKnIlfXJRHPISzLaqbuQjJpHfE
cvCnrmh6J6mXV8sre6/9Iu5qpZ/ZAECjhNSI7PaHjtKIKx/yk9UAvjcYkiRkds/c
Mitx/4fKVhoHqHuxjEbZOfVYpseYMcH2uHNJYnwu84G0z4jD7nfLxPRxfBsO2nQ=
=BEau
-----END PGP SIGNATURE-----

--Apple-Mail=_E001464D-7E6D-4A8F-8797-AD5B5931169B--