Mailing-List: contact user-help@flink.apache.org; run by ezmlm
Precedence: bulk
Reply-To: user@flink.apache.org
Content-Type: multipart/signed;
 boundary="Apple-Mail=_1E9E6D06-646A-4AA4-9CAF-F592C0EF2F4A";
 protocol="application/pgp-signature"; micalg=pgp-sha256
Mime-Version: 1.0 (Mac OS X Mail 9.2 \(3112\))
Subject: Re: Jobmanager HA with Rolling Sink in HDFS
From: Maximilian Bode <maximilian.bode@tngtech.com>
In-Reply-To: <758CA569-6136-4E75-966F-A6AFCD21759E@apache.org>
Date: Tue, 8 Mar 2016 13:19:46 +0100
Message-Id: <F3E5175B-78A4-49CF-B86F-1ABFEDD3BAFA@tngtech.com>
References: <94248245-7C64-40E0-81A7-91B555398DC3@tngtech.com>
 <EB0FBEB9-9BB5-47FA-A8C8-4EC6DBDDB04D@tngtech.com>
 <FC44D61B-985B-4E16-9272-F99BF1F9E02E@apache.org>
 <B6C823EC-4904-431F-852B-D55BF0597DB3@apache.org>
 <C08342CD-0F8D-45A0-A7D9-6CFE2299E533@tngtech.com>
 <758CA569-6136-4E75-966F-A6AFCD21759E@apache.org>
To: user@flink.apache.org


--Apple-Mail=_1E9E6D06-646A-4AA4-9CAF-F592C0EF2F4A
Content-Type: multipart/alternative;
	boundary="Apple-Mail=_F9CA2C03-FB89-440E-A72F-4F0FE5B2359C"


--Apple-Mail=_F9CA2C03-FB89-440E-A72F-4F0FE5B2359C
Content-Transfer-Encoding: quoted-printable
Content-Type: text/plain;
	charset=utf-8

Hi Aljoscha,

oh I see. I was under the impression this file was used internally and =
the output being completed at the end. Ok, so I extracted the relevant =
lines using
	for i in part-*; do head -c $(cat "_$i.valid-length" | strings) =
"$i" > "$i.final"; done
which seems to do the trick.

Unfortunately, now some records are missing again. In particular, there =
are the files
	part-0-0, part-1-0, ..., part-10-0, part-11-0, each with =
corresponding .valid-length files
	part-0-1, part-1-1, ..., part-10-0
in the bucket, where job parallelism=3D12. So it looks to us as if one =
of the files was not even created in the second attempt. This behavior =
seems to be what somewhat reproducible, cf. my earlier email where the =
part-11 file disappeared as well.

Thanks again for your help.

Cheers,
 Max
=E2=80=94
Maximilian Bode * Junior Consultant * maximilian.bode@tngtech.com
TNG Technology Consulting GmbH, Betastr. 13a, 85774 Unterf=C3=B6hring
Gesch=C3=A4ftsf=C3=BChrer: Henrik Klagges, Christoph Stock, Dr. Robert =
Dahlke
Sitz: Unterf=C3=B6hring * Amtsgericht M=C3=BCnchen * HRB 135082

> Am 08.03.2016 um 11:05 schrieb Aljoscha Krettek <aljoscha@apache.org>:
>=20
> Hi,
> are you taking the =E2=80=9C.valid-length=E2=80=9D files into account. =
The problem with doing =E2=80=9Cexactly-once=E2=80=9D with HDFS is that =
before Hadoop 2.7 it was not possible to truncate files. So the trick =
we=E2=80=99re using is to write the length up to which a file is valid =
if we would normally need to truncate it. (If the job fails in the =
middle of writing the output files have to be truncated to a valid =
position.) For example, say you have an output file part-8-0. Now, if =
there exists a file part-8-0.valid-length this file tells you up to =
which position the file part-8-0 is valid. So you should only read up to =
this point.
>=20
> The name of the =E2=80=9C.valid-length=E2=80=9D suffix can also be =
configured, by the way, as can all the other stuff.
>=20
> If this is not the problem then I definitely have to investigate =
further. I=E2=80=99ll also look into the Hadoop 2.4.1 build problem.
>=20
> Cheers,
> Aljoscha
>> On 08 Mar 2016, at 10:26, Maximilian Bode =
<maximilian.bode@tngtech.com> wrote:
>>=20
>> Hi Aljoscha,
>> thanks again for getting back to me. I built from your branch and the =
exception is not occurring anymore. The RollingSink state can be =
restored.
>>=20
>> Still, the exactly-once guarantee seems not to be fulfilled, there =
are always some extra records after killing either a task manager or the =
job manager. Do you have an idea where this behavior might be coming =
from? (I guess concrete numbers will not help greatly as there are so =
many parameters influencing them. Still, in our test scenario, we =
produce 2 million records in a Kafka queue but in the final output files =
there are on the order of 2.1 million records, so a 5% error. The job is =
running in a per-job YARN session with n=3D3, s=3D4 with a checkpointing =
interval of 10s.)
>>=20
>> On another (maybe unrelated) note: when I pulled your branch, the =
Travis build did not go through for -Dhadoop.version=3D2.4.1. I have not =
looked into this further as of now, is this one of the tests known to =
fail sometimes?
>>=20
>> Cheers,
>> Max
>> <travis.log>
>> =E2=80=94
>> Maximilian Bode * Junior Consultant * maximilian.bode@tngtech.com
>> TNG Technology Consulting GmbH, Betastr. 13a, 85774 Unterf=C3=B6hring
>> Gesch=C3=A4ftsf=C3=BChrer: Henrik Klagges, Christoph Stock, Dr. =
Robert Dahlke
>> Sitz: Unterf=C3=B6hring * Amtsgericht M=C3=BCnchen * HRB 135082
>>=20
>>> Am 07.03.2016 um 17:20 schrieb Aljoscha Krettek =
<aljoscha@apache.org>:
>>>=20
>>> Hi Maximilian,
>>> sorry for the delay, we where very busy with the release last week. =
I had a hunch about the problem but I think I found a fix now. The =
problem is in snapshot restore. When restoring, the sink tries to clean =
up any files that where previously in progress. If Flink restores to the =
same snapshot twice in a row then it will try to clean up the leftover =
files twice but they are not there anymore, this causes the exception.
>>>=20
>>> I have a fix in my branch: =
https://github.com/aljoscha/flink/tree/rolling-sink-fix
>>>=20
>>> Could you maybe try if this solves your problem? Which version of =
Flink are you using? You would have to build from source to try it out. =
Alternatively I could build it and put it onto a maven snapshot =
repository for you to try it out.
>>>=20
>>> Cheers,
>>> Aljoscha
>>>> On 03 Mar 2016, at 14:50, Aljoscha Krettek <aljoscha@apache.org> =
wrote:
>>>>=20
>>>> Hi,
>>>> did you check whether there are any files at your specified HDFS =
output location? If yes, which files are there?
>>>>=20
>>>> Cheers,
>>>> Aljoscha
>>>>> On 03 Mar 2016, at 14:29, Maximilian Bode =
<maximilian.bode@tngtech.com> wrote:
>>>>>=20
>>>>> Just for the sake of completeness: this also happens when killing =
a task manager and is therefore probably unrelated to job manager HA.
>>>>>=20
>>>>>> Am 03.03.2016 um 14:17 schrieb Maximilian Bode =
<maximilian.bode@tngtech.com>:
>>>>>>=20
>>>>>> Hi everyone,
>>>>>>=20
>>>>>> unfortunately, I am running into another problem trying to =
establish exactly once guarantees (Kafka -> Flink 1.0.0-rc3 -> HDFS).
>>>>>>=20
>>>>>> When using
>>>>>>=20
>>>>>> RollingSink<Tuple3<Integer,Integer,String>> sink =3D new =
RollingSink<Tuple3<Integer,Integer,String>>("hdfs://our.machine.com:8020/h=
dfs/dir/outbound");
>>>>>> sink.setBucketer(new NonRollingBucketer());
>>>>>> output.addSink(sink);
>>>>>>=20
>>>>>> and then killing the job manager, the new job manager is unable =
to restore the old state throwing
>>>>>> ---
>>>>>> java.lang.Exception: Could not restore checkpointed state to =
operators and functions
>>>>>> 	at =
org.apache.flink.streaming.runtime.tasks.StreamTask.restoreState(StreamTas=
k.java:454)
>>>>>> 	at =
org.apache.flink.streaming.runtime.tasks.StreamTask.invoke(StreamTask.java=
:209)
>>>>>> 	at org.apache.flink.runtime.taskmanager.Task.run(Task.java:559)
>>>>>> 	at java.lang.Thread.run(Thread.java:744)
>>>>>> Caused by: java.lang.Exception: Failed to restore state to =
function: In-Progress file =
hdfs://our.machine.com:8020/hdfs/dir/outbound/part-5-0 was neither moved =
to pending nor is still in progress.
>>>>>> 	at =
org.apache.flink.streaming.api.operators.AbstractUdfStreamOperator.restore=
State(AbstractUdfStreamOperator.java:168)
>>>>>> 	at =
org.apache.flink.streaming.runtime.tasks.StreamTask.restoreState(StreamTas=
k.java:446)
>>>>>> 	... 3 more
>>>>>> Caused by: java.lang.RuntimeException: In-Progress file =
hdfs://our.machine.com:8020/hdfs/dir/outbound/part-5-0 was neither moved =
to pending nor is still in progress.
>>>>>> 	at =
org.apache.flink.streaming.connectors.fs.RollingSink.restoreState(RollingS=
ink.java:686)
>>>>>> 	at =
org.apache.flink.streaming.connectors.fs.RollingSink.restoreState(RollingS=
ink.java:122)
>>>>>> 	at =
org.apache.flink.streaming.api.operators.AbstractUdfStreamOperator.restore=
State(AbstractUdfStreamOperator.java:165)
>>>>>> 	... 4 more
>>>>>> ---
>>>>>> I found a resolved issue [1] concerning Hadoop 2.7.1. We are in =
fact using 2.4.0 =E2=80=93 might this be the same issue?
>>>>>>=20
>>>>>> Another thing I could think of is that the job is not configured =
correctly and there is some sort of timing issue. The checkpoint =
interval is 10 seconds, everything else was left at default value. Then =
again, as the NonRollingBucketer is used, there should not be any timing =
issues, right?
>>>>>>=20
>>>>>> Cheers,
>>>>>> Max
>>>>>>=20
>>>>>> [1] https://issues.apache.org/jira/browse/FLINK-2979
>>>>>>=20
>>>>>> =E2=80=94
>>>>>> Maximilian Bode * Junior Consultant * maximilian.bode@tngtech.com
>>>>>> TNG Technology Consulting GmbH, Betastr. 13a, 85774 Unterf=C3=B6hri=
ng
>>>>>> Gesch=C3=A4ftsf=C3=BChrer: Henrik Klagges, Christoph Stock, Dr. =
Robert Dahlke
>>>>>> Sitz: Unterf=C3=B6hring * Amtsgericht M=C3=BCnchen * HRB 135082
>>>>>>=20
>>>>>=20
>>>>=20
>>>=20
>>=20
>=20


--Apple-Mail=_F9CA2C03-FB89-440E-A72F-4F0FE5B2359C
Content-Transfer-Encoding: quoted-printable
Content-Type: text/html;
	charset=utf-8

<html><head><meta http-equiv=3D"Content-Type" content=3D"text/html =
charset=3Dutf-8"></head><body style=3D"word-wrap: break-word; =
-webkit-nbsp-mode: space; -webkit-line-break: after-white-space;" =
class=3D"">Hi Aljoscha,<div class=3D""><br class=3D""></div><div =
class=3D"">oh I see. I was under the impression this file was used =
internally and the output being completed at the end. Ok, so I extracted =
the relevant lines using</div><div class=3D""><span =
class=3D"Apple-tab-span" style=3D"white-space:pre">	</span>for i in =
part-*; do head -c $(cat "_$i.valid-length" | strings) "$i" &gt; =
"$i.final"; done</div><div class=3D"">which seems to do the =
trick.</div><div class=3D""><br class=3D""></div><div =
class=3D"">Unfortunately, now some records are missing again. In =
particular, there are the files</div><div class=3D""><span =
class=3D"Apple-tab-span" style=3D"white-space:pre">	</span>part-0-0, =
part-1-0, ..., part-10-0, part-11-0, each with corresponding =
.valid-length files</div><div class=3D""><span class=3D"Apple-tab-span" =
style=3D"white-space:pre">	</span>part-0-1, part-1-1, ..., =
part-10-0</div><div class=3D"">in the bucket, where job parallelism=3D12. =
So it looks to us as if one of the files was not even created in the =
second attempt. This behavior seems to be what somewhat reproducible, =
cf. my earlier email where the part-11 file disappeared as =
well.</div><div class=3D""><br class=3D""></div><div class=3D"">Thanks =
again for your help.</div><div class=3D""><br class=3D""></div><div =
class=3D"">Cheers,</div><div class=3D"">&nbsp;Max<br class=3D""><div =
class=3D"">
<div style=3D"color: rgb(0, 0, 0); letter-spacing: normal; orphans: =
auto; text-align: start; text-indent: 0px; text-transform: none; =
white-space: normal; widows: auto; word-spacing: 0px; =
-webkit-text-stroke-width: 0px; word-wrap: break-word; =
-webkit-nbsp-mode: space; -webkit-line-break: after-white-space;" =
class=3D""><div style=3D"color: rgb(0, 0, 0); letter-spacing: normal; =
orphans: auto; text-align: start; text-indent: 0px; text-transform: =
none; white-space: normal; widows: auto; word-spacing: 0px; =
-webkit-text-stroke-width: 0px; word-wrap: break-word; =
-webkit-nbsp-mode: space; -webkit-line-break: after-white-space;" =
class=3D""><div style=3D"color: rgb(0, 0, 0); letter-spacing: normal; =
orphans: auto; text-align: start; text-indent: 0px; text-transform: =
none; white-space: normal; widows: auto; word-spacing: 0px; =
-webkit-text-stroke-width: 0px; word-wrap: break-word; =
-webkit-nbsp-mode: space; -webkit-line-break: after-white-space;" =
class=3D""><div style=3D"color: rgb(0, 0, 0); letter-spacing: normal; =
orphans: auto; text-align: start; text-indent: 0px; text-transform: =
none; white-space: normal; widows: auto; word-spacing: 0px; =
-webkit-text-stroke-width: 0px; word-wrap: break-word; =
-webkit-nbsp-mode: space; -webkit-line-break: after-white-space;" =
class=3D""><div style=3D"color: rgb(0, 0, 0); letter-spacing: normal; =
orphans: auto; text-align: start; text-indent: 0px; text-transform: =
none; white-space: normal; widows: auto; word-spacing: 0px; =
-webkit-text-stroke-width: 0px; word-wrap: break-word; =
-webkit-nbsp-mode: space; -webkit-line-break: after-white-space;" =
class=3D""><div style=3D"color: rgb(0, 0, 0); letter-spacing: normal; =
orphans: auto; text-align: start; text-indent: 0px; text-transform: =
none; white-space: normal; widows: auto; word-spacing: 0px; =
-webkit-text-stroke-width: 0px; word-wrap: break-word; =
-webkit-nbsp-mode: space; -webkit-line-break: after-white-space;" =
class=3D""><div style=3D"color: rgb(0, 0, 0); letter-spacing: normal; =
orphans: auto; text-align: start; text-indent: 0px; text-transform: =
none; white-space: normal; widows: auto; word-spacing: 0px; =
-webkit-text-stroke-width: 0px; word-wrap: break-word; =
-webkit-nbsp-mode: space; -webkit-line-break: after-white-space;" =
class=3D"">=E2=80=94&nbsp;</div><div style=3D"color: rgb(0, 0, 0); =
letter-spacing: normal; orphans: auto; text-align: start; text-indent: =
0px; text-transform: none; white-space: normal; widows: auto; =
word-spacing: 0px; -webkit-text-stroke-width: 0px; word-wrap: =
break-word; -webkit-nbsp-mode: space; -webkit-line-break: =
after-white-space;" class=3D"">Maximilian Bode * Junior Consultant * <a =
href=3D"mailto:maximilian.bode@tngtech.com" =
class=3D"">maximilian.bode@tngtech.com</a><div class=3D"">TNG Technology =
Consulting GmbH, Betastr. 13a, 85774 Unterf=C3=B6hring</div><div =
class=3D"">Gesch=C3=A4ftsf=C3=BChrer: Henrik Klagges, Christoph Stock, =
Dr. Robert Dahlke</div><div class=3D"">Sitz: Unterf=C3=B6hring * =
Amtsgericht M=C3=BCnchen * HRB =
135082</div></div></div></div></div></div></div></div>
</div>
<br class=3D""><div><blockquote type=3D"cite" class=3D""><div =
class=3D"">Am 08.03.2016 um 11:05 schrieb Aljoscha Krettek &lt;<a =
href=3D"mailto:aljoscha@apache.org" =
class=3D"">aljoscha@apache.org</a>&gt;:</div><br =
class=3D"Apple-interchange-newline"><div class=3D""><div class=3D"">Hi,<br=
 class=3D"">are you taking the =E2=80=9C.valid-length=E2=80=9D files =
into account. The problem with doing =E2=80=9Cexactly-once=E2=80=9D with =
HDFS is that before Hadoop 2.7 it was not possible to truncate files. So =
the trick we=E2=80=99re using is to write the length up to which a file =
is valid if we would normally need to truncate it. (If the job fails in =
the middle of writing the output files have to be truncated to a valid =
position.) For example, say you have an output file part-8-0. Now, if =
there exists a file part-8-0.valid-length this file tells you up to =
which position the file part-8-0 is valid. So you should only read up to =
this point.<br class=3D""><br class=3D"">The name of the =
=E2=80=9C.valid-length=E2=80=9D suffix can also be configured, by the =
way, as can all the other stuff.<br class=3D""><br class=3D"">If this is =
not the problem then I definitely have to investigate further. I=E2=80=99l=
l also look into the Hadoop 2.4.1 build problem.<br class=3D""><br =
class=3D"">Cheers,<br class=3D"">Aljoscha<br class=3D""><blockquote =
type=3D"cite" class=3D"">On 08 Mar 2016, at 10:26, Maximilian Bode =
&lt;<a href=3D"mailto:maximilian.bode@tngtech.com" =
class=3D"">maximilian.bode@tngtech.com</a>&gt; wrote:<br class=3D""><br =
class=3D"">Hi Aljoscha,<br class=3D"">thanks again for getting back to =
me. I built from your branch and the exception is not occurring anymore. =
The RollingSink state can be restored.<br class=3D""><br class=3D"">Still,=
 the exactly-once guarantee seems not to be fulfilled, there are always =
some extra records after killing either a task manager or the job =
manager. Do you have an idea where this behavior might be coming from? =
(I guess concrete numbers will not help greatly as there are so many =
parameters influencing them. Still, in our test scenario, we produce 2 =
million records in a Kafka queue but in the final output files there are =
on the order of 2.1 million records, so a 5% error. The job is running =
in a per-job YARN session with n=3D3, s=3D4 with a checkpointing =
interval of 10s.)<br class=3D""><br class=3D"">On another (maybe =
unrelated) note: when I pulled your branch, the Travis build did not go =
through for -Dhadoop.version=3D2.4.1. I have not looked into this =
further as of now, is this one of the tests known to fail sometimes?<br =
class=3D""><br class=3D"">Cheers,<br class=3D""> Max<br =
class=3D"">&lt;travis.log&gt;<br class=3D"">=E2=80=94 <br =
class=3D"">Maximilian Bode * Junior Consultant * <a =
href=3D"mailto:maximilian.bode@tngtech.com" =
class=3D"">maximilian.bode@tngtech.com</a><br class=3D"">TNG Technology =
Consulting GmbH, Betastr. 13a, 85774 Unterf=C3=B6hring<br =
class=3D"">Gesch=C3=A4ftsf=C3=BChrer: Henrik Klagges, Christoph Stock, =
Dr. Robert Dahlke<br class=3D"">Sitz: Unterf=C3=B6hring * Amtsgericht =
M=C3=BCnchen * HRB 135082<br class=3D""><br class=3D""><blockquote =
type=3D"cite" class=3D"">Am 07.03.2016 um 17:20 schrieb Aljoscha Krettek =
&lt;<a href=3D"mailto:aljoscha@apache.org" =
class=3D"">aljoscha@apache.org</a>&gt;:<br class=3D""><br class=3D"">Hi =
Maximilian,<br class=3D"">sorry for the delay, we where very busy with =
the release last week. I had a hunch about the problem but I think I =
found a fix now. The problem is in snapshot restore. When restoring, the =
sink tries to clean up any files that where previously in progress. If =
Flink restores to the same snapshot twice in a row then it will try to =
clean up the leftover files twice but they are not there anymore, this =
causes the exception.<br class=3D""><br class=3D"">I have a fix in my =
branch: <a =
href=3D"https://github.com/aljoscha/flink/tree/rolling-sink-fix" =
class=3D"">https://github.com/aljoscha/flink/tree/rolling-sink-fix</a><br =
class=3D""><br class=3D"">Could you maybe try if this solves your =
problem? Which version of Flink are you using? You would have to build =
from source to try it out. Alternatively I could build it and put it =
onto a maven snapshot repository for you to try it out.<br class=3D""><br =
class=3D"">Cheers,<br class=3D"">Aljoscha<br class=3D""><blockquote =
type=3D"cite" class=3D"">On 03 Mar 2016, at 14:50, Aljoscha Krettek =
&lt;<a href=3D"mailto:aljoscha@apache.org" =
class=3D"">aljoscha@apache.org</a>&gt; wrote:<br class=3D""><br =
class=3D"">Hi,<br class=3D"">did you check whether there are any files =
at your specified HDFS output location? If yes, which files are =
there?<br class=3D""><br class=3D"">Cheers,<br class=3D"">Aljoscha<br =
class=3D""><blockquote type=3D"cite" class=3D"">On 03 Mar 2016, at =
14:29, Maximilian Bode &lt;<a href=3D"mailto:maximilian.bode@tngtech.com" =
class=3D"">maximilian.bode@tngtech.com</a>&gt; wrote:<br class=3D""><br =
class=3D"">Just for the sake of completeness: this also happens when =
killing a task manager and is therefore probably unrelated to job =
manager HA.<br class=3D""><br class=3D""><blockquote type=3D"cite" =
class=3D"">Am 03.03.2016 um 14:17 schrieb Maximilian Bode &lt;<a =
href=3D"mailto:maximilian.bode@tngtech.com" =
class=3D"">maximilian.bode@tngtech.com</a>&gt;:<br class=3D""><br =
class=3D"">Hi everyone,<br class=3D""><br class=3D"">unfortunately, I am =
running into another problem trying to establish exactly once guarantees =
(Kafka -&gt; Flink 1.0.0-rc3 -&gt; HDFS).<br class=3D""><br =
class=3D"">When using<br class=3D""><br =
class=3D"">RollingSink&lt;Tuple3&lt;Integer,Integer,String&gt;&gt; sink =
=3D new RollingSink&lt;Tuple3&lt;Integer,Integer,String&gt;&gt;("<a =
href=3D"hdfs://our.machine.com:8020/hdfs/dir/outbound" =
class=3D"">hdfs://our.machine.com:8020/hdfs/dir/outbound</a>");<br =
class=3D"">sink.setBucketer(new NonRollingBucketer());<br =
class=3D"">output.addSink(sink);<br class=3D""><br class=3D"">and then =
killing the job manager, the new job manager is unable to restore the =
old state throwing<br class=3D"">---<br class=3D"">java.lang.Exception: =
Could not restore checkpointed state to operators and functions<br =
class=3D""><span class=3D"Apple-tab-span" style=3D"white-space:pre">	=
</span>at =
org.apache.flink.streaming.runtime.tasks.StreamTask.restoreState(StreamTas=
k.java:454)<br class=3D""><span class=3D"Apple-tab-span" =
style=3D"white-space:pre">	</span>at =
org.apache.flink.streaming.runtime.tasks.StreamTask.invoke(StreamTask.java=
:209)<br class=3D""><span class=3D"Apple-tab-span" =
style=3D"white-space:pre">	</span>at =
org.apache.flink.runtime.taskmanager.Task.run(Task.java:559)<br =
class=3D""><span class=3D"Apple-tab-span" style=3D"white-space:pre">	=
</span>at java.lang.Thread.run(Thread.java:744)<br class=3D"">Caused by: =
java.lang.Exception: Failed to restore state to function: In-Progress =
file <a href=3D"hdfs://our.machine.com:8020/hdfs/dir/outbound/part-5-0" =
class=3D"">hdfs://our.machine.com:8020/hdfs/dir/outbound/part-5-0</a> =
was neither moved to pending nor is still in progress.<br class=3D""><span=
 class=3D"Apple-tab-span" style=3D"white-space:pre">	</span>at =
org.apache.flink.streaming.api.operators.AbstractUdfStreamOperator.restore=
State(AbstractUdfStreamOperator.java:168)<br class=3D""><span =
class=3D"Apple-tab-span" style=3D"white-space:pre">	</span>at =
org.apache.flink.streaming.runtime.tasks.StreamTask.restoreState(StreamTas=
k.java:446)<br class=3D""><span class=3D"Apple-tab-span" =
style=3D"white-space:pre">	</span>... 3 more<br class=3D"">Caused =
by: java.lang.RuntimeException: In-Progress file <a =
href=3D"hdfs://our.machine.com:8020/hdfs/dir/outbound/part-5-0" =
class=3D"">hdfs://our.machine.com:8020/hdfs/dir/outbound/part-5-0</a> =
was neither moved to pending nor is still in progress.<br class=3D""><span=
 class=3D"Apple-tab-span" style=3D"white-space:pre">	</span>at =
org.apache.flink.streaming.connectors.fs.RollingSink.restoreState(RollingS=
ink.java:686)<br class=3D""><span class=3D"Apple-tab-span" =
style=3D"white-space:pre">	</span>at =
org.apache.flink.streaming.connectors.fs.RollingSink.restoreState(RollingS=
ink.java:122)<br class=3D""><span class=3D"Apple-tab-span" =
style=3D"white-space:pre">	</span>at =
org.apache.flink.streaming.api.operators.AbstractUdfStreamOperator.restore=
State(AbstractUdfStreamOperator.java:165)<br class=3D""><span =
class=3D"Apple-tab-span" style=3D"white-space:pre">	</span>... 4 =
more<br class=3D"">---<br class=3D"">I found a resolved issue [1] =
concerning Hadoop 2.7.1. We are in fact using 2.4.0 =E2=80=93 might this =
be the same issue?<br class=3D""><br class=3D"">Another thing I could =
think of is that the job is not configured correctly and there is some =
sort of timing issue. The checkpoint interval is 10 seconds, everything =
else was left at default value. Then again, as the NonRollingBucketer is =
used, there should not be any timing issues, right?<br class=3D""><br =
class=3D"">Cheers,<br class=3D"">Max<br class=3D""><br class=3D"">[1] <a =
href=3D"https://issues.apache.org/jira/browse/FLINK-2979" =
class=3D"">https://issues.apache.org/jira/browse/FLINK-2979</a><br =
class=3D""><br class=3D"">=E2=80=94 <br class=3D"">Maximilian Bode * =
Junior Consultant * <a href=3D"mailto:maximilian.bode@tngtech.com" =
class=3D"">maximilian.bode@tngtech.com</a><br class=3D"">TNG Technology =
Consulting GmbH, Betastr. 13a, 85774 Unterf=C3=B6hring<br =
class=3D"">Gesch=C3=A4ftsf=C3=BChrer: Henrik Klagges, Christoph Stock, =
Dr. Robert Dahlke<br class=3D"">Sitz: Unterf=C3=B6hring * Amtsgericht =
M=C3=BCnchen * HRB 135082<br class=3D""><br class=3D""></blockquote><br =
class=3D""></blockquote><br class=3D""></blockquote><br =
class=3D""></blockquote><br class=3D""></blockquote><br =
class=3D""></div></div></blockquote></div><br =
class=3D""></div></body></html>=

--Apple-Mail=_F9CA2C03-FB89-440E-A72F-4F0FE5B2359C--

--Apple-Mail=_1E9E6D06-646A-4AA4-9CAF-F592C0EF2F4A
Content-Transfer-Encoding: 7bit
Content-Disposition: attachment;
	filename=signature.asc
Content-Type: application/pgp-signature;
	name=signature.asc
Content-Description: Message signed with OpenPGP using GPGMail

-----BEGIN PGP SIGNATURE-----

iQEcBAEBCAAGBQJW3sNiAAoJEORguq51JMZa180IAKn25m3NWnhNDOypT0kNTLU0
Q3e5V7nbU9UaH7hQNzwUJ1iF28S4/6T7ocGsoMGHWi8l59z6lY4tdIGK7+Gl1Qpl
tn7yVIzY81DFxz6OfwdPig1gJ1txXQChuLE0bswFy0mpTeSMQM+aj55e+FgSseez
99VKZoPMy/HSJsjb60+O3lQ4B1WzCK1yEtedmH/Pj9/caFBg5aWV0ThMTu8GP/Jz
XGD53QIEl+tCx5nuP8w3lcm6j/Ckv0KbRS3+HtA/HoZqzOQO6qW13Pjisxj9eVvi
O0Dud8aJXya9Au09d6Egm7YPteIshs7Gk4jCXo4G2Acp1z+AdeOVSvu/mUoVh7k=
=5HrA
-----END PGP SIGNATURE-----

--Apple-Mail=_1E9E6D06-646A-4AA4-9CAF-F592C0EF2F4A--